ankane/ducklake

DuckLake for PHP

Installs: 0

Dependents: 0

Suggesters: 0

Security: 0

Stars: 1

Watchers: 0

Forks: 0

Open Issues: 0

pkg:composer/ankane/ducklake

v0.1.1 2025-11-09 19:06 UTC

This package is not auto-updated.

Last update: 2025-11-09 19:07:17 UTC


README

🦆 DuckLake for PHP

Run your own data lake with a SQL database and file/object storage

new DuckLake\Client(
    catalogUrl: 'postgres://user:pass@host:5432/dbname',
    storageUrl: 's3://my-bucket/'
);

Learn more

Note: DuckLake is not considered production-ready at the moment

Build Status

Installation

Run:

composer require ankane/ducklake

Add scripts to composer.json to download the shared library:

    "scripts": {
        "post-install-cmd": "DuckLake\\Library::check",
        "post-update-cmd": "DuckLake\\Library::check"
    }

And run:

composer install

Getting Started

Create a client - this one stores everything locally

$ducklake = new DuckLake\Client(
    catalogUrl: 'sqlite:///ducklake.sqlite',
    storageUrl: 'data_files/',
    createIfNotExists: true
);

Create a table

$ducklake->sql('CREATE TABLE events (id bigint, name text)');

Load data from a file

$ducklake->sql("COPY events FROM 'data.csv'");

Confirm a new Parquet file was added to the data lake

$ducklake->listFiles('events');

Query the data

$ducklake->sql('SELECT COUNT(*) FROM events')->toArray();

Catalog Database

Catalog information can be stored in:

  • Postgres: postgres://user@pass@host:5432/dbname
  • SQLite: sqlite:///path/to/dbname.sqlite
  • DuckDB: duckdb:///path/to/dbname.duckdb

Note: MySQL and MariaDB are not currently supported due to duckdb/ducklake#70 and duckdb/ducklake#210

There are two ways to set up the schema:

  1. Run this script
  2. Configure the client to do it
new DuckLake\Client(createIfNotExists: true, ...);

Data Storage

Data can be stored in:

  • Local files: data_files/
  • Amazon S3: s3://my-bucket/path/
  • Other providers: todo

Amazon S3

Credentials are detected in the standard AWS SDK locations

IAM permissions

  • Read: s3::ListBucket, s3::GetObject
  • Write: s3::ListBucket, s3::PutObject
  • Maintenance: s3::ListBucket, s3::GetObject, s3::PutObject, s3::DeleteObject

Operations

Create an empty table

$ducklake->sql('CREATE TABLE events (id bigint, name text)');

Or a table from a file

$ducklake->sql("CREATE TABLE events AS FROM 'data.csv'");

Load data from a file

$ducklake->sql("COPY events FROM 'data.csv'");

You can also load data directly from other data sources

$ducklake->attach('blog', 'postgres://localhost:5432/blog');
$ducklake->sql('INSERT INTO events SELECT * FROM blog.events');

Or register existing data files

$ducklake->addDataFiles('events', 'data.parquet');

Note: This transfers ownership to the data lake, so the file may be deleted as part of maintenance

Update data

$ducklake->sql('UPDATE events SET name = ? WHERE id = ?', ['Test', 1]);

Delete data

$ducklake->sql('DELETE * FROM events WHERE id = ?', [1]);

Schema Changes

Update the schema

$ducklake->sql('ALTER TABLE events ADD COLUMN active BOOLEAN');

Set or remove a partitioning key

$ducklake->sql('ALTER TABLE events SET PARTITIONED BY (name)');
// or
$ducklake->sql('ALTER TABLE events RESET PARTITIONED BY');

Views

Create a view

$ducklake->sql('CREATE VIEW events_view AS SELECT * FROM events');

Drop a view

$ducklake->sql('DROP VIEW events_view');

Snapshots

Get snapshots

$ducklake->snapshots();

Query the data at a specific snapshot version or time

$ducklake->sql('SELECT * FROM events AT (VERSION => ?)', [3]);
// or
$ducklake->sql('SELECT * FROM events AT (TIMESTAMP => ?)', [new DateTime()]);

You can also specify a snapshot when creating the client

new DuckLake\Client(snapshotVersion: 3, ...);
// or
new DuckLake\Client(snapshotTime: new DateTime(), ...);

Maintenance

Merge files

$ducklake->mergeAdjacentFiles();

Expire snapshots

$ducklake->expireSnapshots(olderThan: new DateTime());

Clean up old files

$ducklake->cleanupOldFiles(olderThan: new DateTime());

Rewrite files with a certain percentage of deleted rows

$ducklake->rewriteDataFiles(deleteThreshold: 0.5);

Configuration

Get options

$ducklake->options();

Set an option globally

$ducklake->setOption('parquet_compression', 'zstd');

Or for a specific table

$ducklake->setOption('parquet_compression', 'zstd', tableName: 'events');

SQL Safety

Use parameterized queries when possible

$ducklake->sql('SELECT * FROM events WHERE id = ?', [1]);

For places that do not support parameters, use quote or quoteIdentifier

$quotedTable = $ducklake->quoteIdentifier('events');
$quotedFile = $ducklake->quote('path/to/data.csv');
$ducklake->sql("COPY $quotedTable FROM $quotedFile");

Reference

Get table info

$ducklake->tableInfo();

Drop a table

$ducklake->dropTable('events');
// or
$ducklake->dropTable('events', ifExists: true);

List files

$ducklake->listFiles('events');

List files at a specific snapshot version or time

$ducklake->listFiles('events', snapshotVersion: 3);
// or
$ducklake->listFiles('events', snapshotTime: new DateTime());

History

View the changelog

Contributing

Everyone is encouraged to help improve this project. Here are a few ways you can help:

To get started with development:

git clone https://github.com/ankane/ducklake-php.git
cd ducklake-php
composer install

# Postgres
createdb ducklake_php_test
createdb ducklake_php_test2
CATALOG=postgres composer test

# MySQL and MariaDB
mysqladmin create ducklake_php_test
mysqladmin create ducklake_php_test2
CATALOG=mysql composer test

# SQLite
CATALOG=sqlite composer test

# DuckDB
CATALOG=duckdb composer test