Examples¶
Get value by row and column¶
This is just an example to show how you can go into low level handling directly datachunks and vectors from the result.
DuckDB C API docs can be useful to understand those concepts.
Performance is not the best here. You should select only
the values that you need instead of using a SELECT * query
to actually retrieve only one from the result,
but the example works to show how is data stored internally.
<?php
require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;
use Saturio\DuckDB\Result\DataChunk;
$duckDB = DuckDB::create();
$duckDB->query("CREATE TABLE example_4 (column1 VARCHAR, column2 VARCHAR, column3 VARCHAR);");
$duckDB->query("INSERT INTO example_4 (column1, column2, column3) VALUES ('0x0', '0x1', '0x2'), ('1x0', '1x1', '1x2'), ('2x0', '2x1', '2x2'), ('3x0', '3x1', '3x2');");
$result = $duckDB->query("SELECT * FROM example_4");
$value3x1 = get_value(['row' => 3, 'column' => 1], $result);
printf("%s is the value in position 3,1\n", $value3x1);
function get_value(array $position, \Saturio\DuckDB\Result\ResultSet $resultSet): string
{
$rowsInPreviousChunks = 0;
/** @var DataChunk $chunk */
foreach ($resultSet->chunks() as $chunk) {
$rowCount = $chunk->rowCount();
$columnCount = $chunk->columnCount();
if ($columnCount < $position['column']) {
throw new Exception('Column required is out of range');
}
if ($rowCount + $rowsInPreviousChunks < $position['row']) {
$rowsInPreviousChunks += $rowCount;
continue;
}
$vector = $chunk->getVector($position['column'], rows: $rowCount);
$dataGenerator = $vector->getDataGenerator();
for ($rowIndex = 0; $rowIndex < $rowCount; ++$rowIndex) {
$realRowIndex = $rowsInPreviousChunks + $rowIndex;
if ($realRowIndex === $position['row']) {
return $dataGenerator->current();
}
$dataGenerator->next();
}
}
throw new Exception('Row required is out of range');
}
Remote parquet file¶
Read a remote parquet file and query some aggregates.
<?php
require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;
DuckDB::sql(
'SELECT "Reporting Year", avg("Gas Produced, MCF") as "AVG Gas Produced"
FROM "https://github.com/plotly/datasets/raw/refs/heads/master/oil-and-gas.parquet"
WHERE "Reporting Year" BETWEEN 1985 AND 1990
GROUP BY "Reporting Year";'
)->print();
Summarize remote csv¶
Use of SUMMARIZE TABLE over a remote CSV file.
<?php
require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;
DuckDB::sql('SUMMARIZE TABLE "https://blobs.duckdb.org/data/Star_Trek-Season_1.csv";')->print();
On the fly schema¶
Create table schema and insert data from a JSON.
<?php
require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;
$json = <<<'JSON'
{
"name": "Example event",
"description": "Optional description",
"startDate": 1413384452,
"endDate": 1413394452,
"type": {
"id": "00000000-0000-1100-0000-000000000011"
}
}
JSON;
$tmpJson = tempnam(sys_get_temp_dir(), 'duckdb_test_');
$handle = fopen($tmpJson, "w");
fwrite($handle, $json);
fclose($handle);
$duckDB = DuckDB::create('on-the-fly-schema.db');
$duckDB->query("CREATE TABLE events AS FROM read_json('{$tmpJson}');");
unlink($tmpJson);
$duckDB->query("SELECT * FROM events;")->print();
unset($duckDB);
unlink('on-the-fly-schema.db');
UI plugin¶
Start a local web UI using the DuckDB UI plugin. When the PHP process finish the connection is closed so the UI is also stopped. In this example we sleep for 120 seconds to illustrate the use.
<?php
require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;
$duckDB = DuckDB::create('my-ui.db');
$duckDB->query("CALL start_ui();");
sleep(120);
Monitorize local network usage¶
Only works for macOS due to its reliance on macOS-specific APIs for monitoring network usage. It stores network information every 2 seconds and also opens the local UI to query data.
Can be stopped with Cmd+C.
<?php
require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;
const CREATE_TABLE_FOR_RAW_DATA_QUERY = "CREATE TABLE network_usage_raw AS FROM read_csv('/tmp/network-use.csv');";
const TRANSFORMATION_QUERY = 'SELECT bucket.col0 as bucket_uuid, current_date() + time as log_created_at, * EXCLUDE(time, col0) FROM VALUES(UUID()) as bucket, network_usage_raw';
const CREATE_TABLE_FOR_ANALYSIS = 'CREATE TABLE network_usage as (' . TRANSFORMATION_QUERY . ');';
const ADD_RAW_DATA_QUERY = "COPY network_usage_raw FROM '/tmp/network-use.csv' (HEADER);";
const ADD_DATA_FOR_ANALYSIS = 'INSERT INTO network_usage (' . TRANSFORMATION_QUERY . ');';
const TRUNCATE_RAW_DATA_QUERY = "TRUNCATE network_usage_raw;";
$duckDB = DuckDB::create('my-network-usage.db');
exec('nettop -PL1 > /tmp/network-use.csv');
$duckDB->query(CREATE_TABLE_FOR_RAW_DATA_QUERY);
unlink('/tmp/network-use.csv');
$duckDB->query(CREATE_TABLE_FOR_ANALYSIS);
$duckDB->query(TRUNCATE_RAW_DATA_QUERY);
$duckDB->query("CALL start_ui();");
while(true) {
exec('nettop -PL1 > /tmp/network-use.csv');
$duckDB->query(ADD_RAW_DATA_QUERY);
unlink('/tmp/network-use.csv');
$duckDB->query(ADD_DATA_FOR_ANALYSIS);
$duckDB->query(TRUNCATE_RAW_DATA_QUERY);
sleep(2);
}
If you wait for a while to store enough data, you can run some of this query using the UI interface.
SELECT
log_time,
ARG_MAX(app, bytes_in) AS top_app_bytes_in,
MAX(bytes_in) AS bytes_in,
ARG_MAX(app, bytes_out) AS top_app_bytes_out,
MAX(bytes_out) AS bytes_out,
FROM (
SELECT date_trunc('minute', log_created_at) AS log_time,
column01 as app,
max(bytes_in) - min(bytes_in) as bytes_in,
max(bytes_out) - min(bytes_out) as bytes_out
FROM network_usage
GROUP BY ALL
)
GROUP BY log_time
ORDER BY log_time
SELECT date_trunc('minute', log_created_at) AS log_time,
column01 as app,
max(bytes_in) - min(bytes_in) as bytes_in,
max(bytes_out) - min(bytes_out) as bytes_out
FROM network_usage
GROUP BY ALL
SELECT
time_range,
SUM(bytes_in) as total_bytes_in,
SUM(bytes_out) as total_bytes_out
FROM
(SELECT
time_bucket(INTERVAL '5 MINUTES', log_created_at) as time_range,
column01 as app,
max(bytes_in) - min(bytes_in) as bytes_in,
max(bytes_out) - min(bytes_out) as bytes_out
FROM network_usage
GROUP BY ALL)
GROUP BY time_range
ORDER BY time_range
SELECT
max(last_update) as last_update,
time_range,
SUM(bytes_in) / 30 as bytes_per_second_download,
SUM(bytes_out) / 30 as bytes_per_second_upload
FROM
(SELECT
time_bucket(INTERVAL '30 SECONDS', log_created_at, INTERVAL '-30 SECONDS') as time_range,
max(log_created_at) as last_update,
column01 as app,
max(bytes_in) - min(bytes_in) as bytes_in,
max(bytes_out) - min(bytes_out) as bytes_out
FROM network_usage
GROUP BY ALL)
GROUP BY time_range
ORDER BY time_range DESC
LIMIT 100
SELECT
SUM(bytes_in) / 10 as current_bytes_per_second_download,
SUM(bytes_out) / 10 as current_bytes_per_second_upload
FROM
(SELECT
column01 as app,
max(bytes_in) - min(bytes_in) as bytes_in,
max(bytes_out) - min(bytes_out) as bytes_out
FROM network_usage
WHERE log_created_at > (
SELECT
max(log_created_at) - INTERVAL '10 SECONDS'
FROM
network_usage
)
GROUP BY app)