Examples

Get value by row and column

This is just an example to show how you can go into low level handling directly datachunks and vectors from the result.

DuckDB C API docs can be useful to understand those concepts.

Performance is not the best here. You should select only the values that you need instead of using a SELECT * query to actually retrieve only one from the result, but the example works to show how is data stored internally.

<?php
require __DIR__ . '/../vendor/autoload.php';

use Saturio\DuckDB\DuckDB;
use Saturio\DuckDB\Result\DataChunk;

$duckDB = DuckDB::create();

$duckDB->query("CREATE TABLE example_4 (column1 VARCHAR, column2 VARCHAR, column3 VARCHAR);");
$duckDB->query("INSERT INTO example_4 (column1, column2, column3) VALUES ('0x0', '0x1', '0x2'), ('1x0', '1x1', '1x2'), ('2x0', '2x1', '2x2'), ('3x0', '3x1', '3x2');");

$result = $duckDB->query("SELECT * FROM example_4");

$value3x1 = get_value(['row' => 3, 'column' => 1], $result);

printf("%s is the value in position 3,1\n", $value3x1);

function get_value(array $position, \Saturio\DuckDB\Result\ResultSet $resultSet): string
{
    $rowsInPreviousChunks = 0;
    /** @var DataChunk $chunk */
    foreach ($resultSet->chunks() as $chunk) {
        $rowCount = $chunk->rowCount();
        $columnCount = $chunk->columnCount();

        if ($columnCount < $position['column']) {
            throw new Exception('Column required is out of range');
        }

        if ($rowCount + $rowsInPreviousChunks < $position['row']) {
            $rowsInPreviousChunks += $rowCount;
            continue;
        }

        $vector = $chunk->getVector($position['column'], rows: $rowCount);
        $dataGenerator = $vector->getDataGenerator();

        for ($rowIndex = 0; $rowIndex < $rowCount; ++$rowIndex) {
            $realRowIndex = $rowsInPreviousChunks + $rowIndex;
            if ($realRowIndex === $position['row']) {
                return $dataGenerator->current();
            }
            $dataGenerator->next();
        }
    }
    throw new Exception('Row required is out of range');
}

Remote parquet file

Read a remote parquet file and query some aggregates.

<?php


require __DIR__ . '/../vendor/autoload.php';

use Saturio\DuckDB\DuckDB;

DuckDB::sql(
    'SELECT "Reporting Year", avg("Gas Produced, MCF") as "AVG Gas Produced" 
                FROM "https://github.com/plotly/datasets/raw/refs/heads/master/oil-and-gas.parquet" 
                WHERE "Reporting Year" BETWEEN 1985 AND 1990
                GROUP BY "Reporting Year";'
)->print();

Summarize remote csv

Use of SUMMARIZE TABLE over a remote CSV file.

<?php


require __DIR__ . '/../vendor/autoload.php';

use Saturio\DuckDB\DuckDB;

DuckDB::sql('SUMMARIZE TABLE "https://blobs.duckdb.org/data/Star_Trek-Season_1.csv";')->print();

On the fly schema

Create table schema and insert data from a JSON.

<?php

require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;

$json = <<<'JSON'
{
  "name": "Example event",
  "description": "Optional description",
  "startDate": 1413384452,
  "endDate": 1413394452,
  "type": {
    "id": "00000000-0000-1100-0000-000000000011"
  }
}
JSON;

$tmpJson = tempnam(sys_get_temp_dir(), 'duckdb_test_');
$handle = fopen($tmpJson, "w");
fwrite($handle, $json);
fclose($handle);

$duckDB = DuckDB::create('on-the-fly-schema.db');
$duckDB->query("CREATE TABLE events AS FROM read_json('{$tmpJson}');");
unlink($tmpJson);
$duckDB->query("SELECT * FROM events;")->print();
unset($duckDB);
unlink('on-the-fly-schema.db');

UI plugin

Start a local web UI using the DuckDB UI plugin. When the PHP process finish the connection is closed so the UI is also stopped. In this example we sleep for 120 seconds to illustrate the use.

<?php

require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;

$duckDB = DuckDB::create('my-ui.db');
$duckDB->query("CALL start_ui();");
sleep(120);

Monitorize local network usage

Only works for macOS due to its reliance on macOS-specific APIs for monitoring network usage. It stores network information every 2 seconds and also opens the local UI to query data.

Can be stopped with Cmd+C.

<?php
require __DIR__ . '/../vendor/autoload.php';
use Saturio\DuckDB\DuckDB;

const CREATE_TABLE_FOR_RAW_DATA_QUERY = "CREATE TABLE network_usage_raw AS FROM read_csv('/tmp/network-use.csv');";
const TRANSFORMATION_QUERY = 'SELECT bucket.col0 as bucket_uuid, current_date() + time as log_created_at, * EXCLUDE(time, col0) FROM VALUES(UUID()) as bucket, network_usage_raw';
const CREATE_TABLE_FOR_ANALYSIS = 'CREATE TABLE network_usage as (' . TRANSFORMATION_QUERY . ');';
const ADD_RAW_DATA_QUERY = "COPY network_usage_raw FROM '/tmp/network-use.csv' (HEADER);";
const ADD_DATA_FOR_ANALYSIS = 'INSERT INTO network_usage (' . TRANSFORMATION_QUERY . ');';
const TRUNCATE_RAW_DATA_QUERY = "TRUNCATE network_usage_raw;";

$duckDB = DuckDB::create('my-network-usage.db');

exec('nettop -PL1 > /tmp/network-use.csv');
$duckDB->query(CREATE_TABLE_FOR_RAW_DATA_QUERY);
unlink('/tmp/network-use.csv');
$duckDB->query(CREATE_TABLE_FOR_ANALYSIS);
$duckDB->query(TRUNCATE_RAW_DATA_QUERY);
$duckDB->query("CALL start_ui();");

while(true) {
    exec('nettop -PL1 > /tmp/network-use.csv');
    $duckDB->query(ADD_RAW_DATA_QUERY);
    unlink('/tmp/network-use.csv');
    $duckDB->query(ADD_DATA_FOR_ANALYSIS);
    $duckDB->query(TRUNCATE_RAW_DATA_QUERY);
    sleep(2);
}

If you wait for a while to store enough data, you can run some of this query using the UI interface.

SELECT 
    log_time,
    ARG_MAX(app, bytes_in) AS top_app_bytes_in,
    MAX(bytes_in) AS bytes_in,
    ARG_MAX(app, bytes_out) AS top_app_bytes_out,
    MAX(bytes_out) AS bytes_out,
FROM (
    SELECT date_trunc('minute', log_created_at) AS log_time,
      column01 as app,
    max(bytes_in) - min(bytes_in) as bytes_in,
    max(bytes_out) - min(bytes_out) as bytes_out
    FROM network_usage
    GROUP BY ALL
) 
GROUP BY log_time
ORDER BY log_time
SELECT date_trunc('minute', log_created_at) AS log_time,
  column01 as app,
max(bytes_in) - min(bytes_in) as bytes_in,
max(bytes_out) - min(bytes_out) as bytes_out
FROM network_usage
GROUP BY ALL
SELECT
  time_range,
  SUM(bytes_in) as total_bytes_in,
  SUM(bytes_out) as total_bytes_out
  FROM
(SELECT 
  time_bucket(INTERVAL '5 MINUTES', log_created_at) as time_range,
  column01 as app,
  max(bytes_in) - min(bytes_in) as bytes_in,
  max(bytes_out) - min(bytes_out) as bytes_out
  FROM network_usage
GROUP BY ALL)
  GROUP BY time_range
ORDER BY time_range
SELECT
  max(last_update) as last_update,
  time_range,
  SUM(bytes_in) / 30 as bytes_per_second_download,
  SUM(bytes_out) / 30 as bytes_per_second_upload
  FROM
(SELECT 
  time_bucket(INTERVAL '30 SECONDS', log_created_at, INTERVAL '-30 SECONDS') as time_range,
  max(log_created_at) as last_update,
  column01 as app,
  max(bytes_in) - min(bytes_in) as bytes_in,
  max(bytes_out) - min(bytes_out) as bytes_out
  FROM network_usage
GROUP BY ALL)
  GROUP BY time_range
ORDER BY time_range DESC
LIMIT 100
SELECT
  SUM(bytes_in) / 10 as current_bytes_per_second_download,
  SUM(bytes_out) / 10 as current_bytes_per_second_upload
FROM
(SELECT 
  column01 as app,
  max(bytes_in) - min(bytes_in) as bytes_in,
  max(bytes_out) - min(bytes_out) as bytes_out
  FROM network_usage
  WHERE log_created_at > (
    SELECT 
      max(log_created_at) - INTERVAL '10 SECONDS' 
    FROM 
      network_usage
  )
GROUP BY app)