survos/pixie-bundle

Symfony Bundle to easily allow a SQLite Key Value lookup. Optionally uses zip and bunny utilities for sync

Fund package maintenance!
kbond

Installs: 369

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 3

Forks: 0

Open Issues: 3

Type:symfony-bundle

1.5.378 2024-11-04 17:03 UTC

README

A Symfony bundle that leverages Sqlite to create an indexed structured datastore without an ORM. In particular, it is a solution for analyzing Excel spreadsheets and nest JSON data.

Inspired by https://gist.github.com/sbrl/c3bfbbbb3d1419332e9ece1bac8bb71c and https://dexie.org/ and the Symfony PDO Cache component.

Example

bin/console pixie:import <configCode> <subCode>

bin/console pixie:init walters
# reads walters.config, see that it's github, downloads the repo to /data/git/walters-api
bin/console pixie:import walters 
# reads files from /data/git/walters-api to pixie/git/walters.db.pixie and walters-translation, 
bin/console pixie:trans walters
# reads files from pixie/git/walters.db.pixie and walters-translation

bin/console pixie:import md
# reads walters.config, see that it's system, run init:md and downloads files to data/system/md (raw, etc)

bin/console pixie:import musdig/74
# reads md.config, see that it's musdig, downloads the repo to /data/musdig/74/74.db and 74-translation.db


## Storage

This bundle integrates with survos/bunny-bundle to store directories as zip files.  That way, files processed on one machine can easily move to another.

There are 4 directories that are used within the museado project, but they all do the same thing, zip and upload, or download and unzip.

```bash
bin/console bunny:download data/larco/raw.zip --unzip data/larco/raw

unzip larco-raw.zip -d data/larco/raw/


# after some processing...
bin/console bunny:upload data/larco/json/ --zip data/larco/json.zip

bin/console bunny:download data/glam.zip 
# downloads to data/glam.zip

bin/console bunny:download data/glam.zip --unzip
# downloads to data/glam.zip unless it exists, unzips files to data/glam

bin/console bunny:download data/glam.zip --unzip --force
# downloads to data/glam.zip even if it exists, unzips files to data/glam

bin/console bunny:download data/flickr.pixie.db 
# downloads to directory data 

bin/console bunny:download data/flickr.pixie.db pixie/
# downloads to pixie/flickr.pixie.db

bin/console bunny:download data/flickr.pixie.db x.db
# downloads to x.db

bin/console bunny:download data/flickr.pixie.db pixie/x.db
# downloads to pixie/x.db

bin/console bunny:download data/glam.zip --unzip --force
# downloads to data/glam.zip even if it exists, unzips files to data/glam

bin/console bunny:download data/glam.zip zip/my-filename.zip
# downloads to zip/my-filename.zip

bin/console bunny:download data/glam.zip --unzip ../translation
# downloads to data/glam.zip, unzips files to ../translation (above the project root).

bin/console bunny:download data/adh.zip --unzip data/adh --cleanup

At its core, the idea is to store strings or unstructured data accessible by a key or a filter.

Initially, it was just a string lookup, which could be a JSON string, for example looking up a wikidata object by its QID, or a movie from a csv file by its imdb_id.

You can customize the import process by adding a .conf file with directives that facilitate renaming fields and munging data. You can also listen for events during the import and export process.

Pixie can (will, eventually) use json schema files to defining the fields. https://json-schema.org/ using LiFormBundle

The indexes can be defined via the command line, a listener, or a .conf file. There's a succinct format that is a comma-separated string, and a detailed format with is a hash with the index details as keys.

There is (will be) an API endpoint if api-platform is installed.

Integration with survos/translation-bundle

To import a collection of csv files in a directory to a Pixie database, we'll need to do the following

  • each csv file becomes a table, which we often rename
  • rename the columns
  • map data to specific columns, by default simply extract the json
  • allow arrays and objects to be property encoded, e.g. genre| or tags,
  • add columns that aren't present in the csv, e.g. status or createdAt
  • tweak the data, including casting to type and applying regex rules
  • @todo: map related tables, e.g. artwork.artist -> artist table
  • create indexes for filter queries
  • meili index the pixie file

Pixie-specific naming

When creating the table, we add a json snippet after the property creation to indicate some attributes. To support related tables and translation, there are some specific columns names we use that are (maybe?) prefixed by _ or __:

Or should we do this?

  • _key: the primary key of a table that can be used by a related table
  • _attributes: the JSON blob of everything that's not a property
  • _label: the label that is used when a relation is displayed in another table. Translatable. Definable
  • _description: eventually we'll support defining translatable properties but for now we just support _description

Events / Data Adjustments

During the import, columns can be renamed with the rules key, and data can be tweaked with the formatters key.

Additionally, a RowEvent is called for before and after every row is inserted into the StorageBox, as well as before and after the import process. This allows for creating new fields or tables or cleaning up data that needs services.

    #[AsEventListener(event: RowEvent::class)]
    public function onRowEvent(RowEvent $event): void
    {
        static $mun = []; // the array of municipios
        $row = $event->row;
        switch ($event->type) {
            case $event::POST_LOAD:
                ksort($mun);
                // save this to a new table or JSON file
                break;
            case $event::LOAD:
//                    https://www.facebook.com/MuseoGuadalupePosada?fref=ts -> MuseoGuadalupePosada
                if ($facebook = $row['facebook']??null) {
                    $parseUrl = parse_url($facebook);
                    $row['facebook'] = trim($parseUrl['path'],'/');
                }
                $mun[$row['municipio_id']] = $row['municipio'];
        }
    }

Workflows

Manipulating data that involves web services can be done with workflows, so events can be dispatched async. For example, an 'enhance' event can use the facebook id above to get the latest events, or the latest instagram post.

bin/console survos:make:workflow MexMus new,enhanced enhance --entity-class=Survos\\PixieBundle\\Model\\Item

and then

Setup

All pixie db files have an associated configuration file that describes the mapping and underlying data structure. Generally it shares the same base filename.

curl 
bin/console pixie:init movies --dir=./data/imdb 
# ./pixie/movies.yaml created with 4 tables, configure it and run bin/console pixie:import movies --limit 10  
cat > pixie.movies << 'END'
(full config)
END
bin/console pixie:import movies --limit 10  
symfony open:local --path="/pixie/movies"

Loading a Pixie

As the data directories (where the csv/json files are located) can be anywhere, there are a few way to indicate them.

# explicitly 
bin/console pixie:import /path/to/csv/directory -c directory.yaml
# use source.dir key, absolute, or relative to bundle data_root).  If !exists, use the bundle data_root and append 'code'.
bin/console pixie:import -c code.yaml

There are 3 way to load data into pixie.

"Special" Tables

Since pixie databases offer a convenient way to work with Excel data, there is a special table for handling drawings (embedded images). It is created with bin/console grid:excel-to-csv (in the grid-group bundle? In museado?)

Excel stores embedded images as "Drawings".

Translations are also stored as pixie tables, and have their own section.

Examples

  • Movies (imdb)
  • Schools

CSV Datasets

Reading an Existing Pixie

$pixie = new Pixie::Reader('school.pixie');
$pixie->select('mo')
foreach ($pixie->)
wget https://dummyjson.com/products products.json
// inject the service

$id = 'tt123';
$kv = $pixieService->getStorageBox('dummy.pixie', [
    'products' => 'sku,brand,category' // first key is text primary key by default
]);

$kv->select('products'); // so that we don't have to pass it each time.

$kv->set($data); // because they key is in the data.
assert($kv->get($id));
assert($kv->has($id));
assert(json_decode($kv->get($id)) == $data);
//

The conf file simplifies some of the php calls, but isn't 100% necessary

Suppose we want to filter by category. First, we need to add an index, dexie-style, to the table.

$kv = $pixieService->getStorageBox('app.db', [
    'movies' => 'imdb_id, year|integer, category|string' // first key is text primary key by default
]);

$rows = $kv->where("year < 2000 and category='drama'")->iterate();
// without the index
$rows = $kv->where("json_extract(value, '$.year') < 2000")->iterate();

Often CSV files and other data sources have key names that aren't really compatible with sqlite column names. To map the old names to the new ones, you can create a sequence of regex rules for the column names.

From MOMA Artists.json

{
  "ConstituentID": 1,
  "DisplayName": "Robert Arneson",
  "ArtistBio": "American, 1930–1992",
  "Nationality": "American",
  "Gender": "male",
  "BeginDate": 1930,
  "EndDate": 1992,
  "Wiki QID": null
}
$kv->map([
   '/ConstituentID/' => 'id',
   '/BeginDate/' => 'birthYear',
   '/EndDate/' => 'deathYear'
], [
    'artists'
]);

All fields will be converted to camel_case, even without regex rules.

Note that the JSON is NOT compressed, so the keys are duplicated. So the sqlite file is larger than the CSV by (rowCount * headerRowSize).

Usage

composer require survos/pixie-bundle

Initialize StorageBox with the sqlite database name plus any new tables to be created.

Existing tables can be used, but automatic table creation is disabled.

$kvDb = new StorageBox('translation.db', ['es', 'en'], 'en');
$key = md5('dog');
$kvDb->set($key, 'perro', 'es');
$kvDb->set($key, 'dog'); // defaults to en
$trans = $kvDb->get($key); // dog 
$trans = $kvDb->get($key, 'es'); // perro 
// trans is perro

// cache-like with callcack
$trans = $kvDb->get($key, callback: fn($item) => $this->translator->trans('dog'));

// keys are automatically slugified
$trans = $kvDb->get('My name is', fn($item) => $this->translator->trans($item->key));

// keys are automatically slugified
$trans = $kvDb->get('My name is',   ) => $this->translator->trans($item->key));

Multi-lingual StorageBox

Specifically for kv lookups of text fields that may have translations

$kvDb = new MLSB('property.db', ['label','description']);

$key = 'Q31';
$kvDb->set($key, 'label', 'Belgium', 'en-gb');
$kvDb->set($key, 'description', 'constitutional monarchy in Western Europe', 'en-gb');

$transArray = $kvDb->getFieldsByLocale($key, 'en-gb'); 
// ['label' => 'Belgium'...]

$label = $kvDb->getField($key, 'label', 'en-gb'); 
// 'Belgium'

$transArray = $kvDb->getFields($key); 
// ['en-gb' => ['label' => 'Belgium', 'description'  => ...]]

// ditto for set, needs transactions

Credits

Debug icon from https://www.svgrepo.com/svg/11690/database

Backups

For some reason, the .wal file stays open. We need a utility that does the following

  • execute the query "vacuum into 'xx.backup.db'"
  • zip the backup
  • move the backup to s3 or bunny

To restore, do the opposite

  • Fetch the zipped backup
  • optionally rename the existing database
  • remove .wal, etc.
  • unzip

S3 example

aws s3 sync . s3://voxitour/data/pixie --acl public-read --exclude="" --include="met.zip" aws s3 sync s3://voxitour/data/pixie . --acl public-read --exclude="" --include="met.zip"

aws s3 sync . s3://voxitour/data/pixie --acl public-read --exclude="" --include="x.pixie.db" aws s3 sync s3://voxitour/data/pixie . --acl public-read --exclude="" --include="x.pixie.db"

aws s3 sync s3://voxitour/data/pixie . --acl public-read --exclude="" --include=".zip"

Notes

This may come in handy someday