PHP ETL (Extract / Transform / Load) implementation. Almost no dependency.

3.0-alpha.2 2019-04-08 10:49 UTC

README

Latest Stable Version License Build Status Coverage Status Quality Score Total Downloads

Okay, so you heard about the Extract / Transform / Load pattern and you're looking for a PHP library to do the stuff.

Alright, let's go!

Installation

composer require bentools/etl:^3.0@alpha

Warning: version 3.0 is a complete rewrite and a involves important BC breaks. Don't upgrade from ^2.0 unless you know what you're doing!

Usage

To sum up, you will apply transformations onto an iterable of any things in order to load them in some place. Sometimes your iterable is ready to go, sometimes you just don't need to perform transformations, but anyway you need to load that data somewhere.

Let's start with a really simple example:

use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Loader\JsonFileLoader;

$data = [
    'foo',
    'bar',
];

$etl = EtlBuilder::init()
    ->loadInto(JsonFileLoader::toFile(__DIR__.'/data.json'))
    ->createEtl();
$etl->process($data);

Basically you just loaded the string ["foo","bar"] into data.json. Yay!

Now let's apply a basic uppercase transformation:

use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Loader\JsonFileLoader;

$data = [
    'foo',
    'bar',
];

$etl = EtlBuilder::init()
    ->transformWith(new CallableTransformer('strtoupper'))
    ->loadInto(JsonFileLoader::factory())
    ->createEtl();
$etl->process($data, __DIR__.'/data.json'); // You can also set the output file when processing :-)

Didn't you just write the string ["FOO","BAR"] into data.json ? Yes, you did!

Okay, but what if your source data is not an iterable (yet)? It can be a CSV file or a CSV string, for instance. Here's another example:

use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Extractor\CsvExtractor;
use BenTools\ETL\Loader\JsonFileLoader;

$data = <<<CSV
country_code,country_name,president
US,USA,"Donald Trump"
RU,Russia,"Vladimir Putin"
CSV;

$etl = EtlBuilder::init()
    ->extractFrom(new CsvExtractor())
    ->loadInto(JsonFileLoader::factory(['json_options' => \JSON_PRETTY_PRINT]))
    ->createEtl();
$etl->process($data, __DIR__.'/data.json');

As you guessed, the following content was just written into presidents.json:

[
    {
        "country_code": "US",
        "country_name": "USA",
        "president": "Donald Trump"
    },
    {
        "country_code": "RU",
        "country_name": "Russia",
        "president": "Vladimir Putin"
    }
]

We provide helpful extractors and loaders to manipulate JSON, CSV, text, and you'll also find a DoctrineORMLoader for when your transformer yields Doctrine entities.

Because yes, a transformer must return a \Generator. Why? Because a single extracted item can lead to several output items. Let's take a more sophisticated example:

use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Extractor\JsonExtractor;

$pdo = new \PDO('mysql:host=localhost;dbname=test');
$input = __DIR__.'/presidents.json';

$etl = EtlBuilder::init()
    ->extractFrom(new JsonExtractor())
    ->transformWith(
        function ($item) use ($pdo) {
            $stmt = $pdo->prepare('SELECT country_code FROM countries WHERE country_code = ?');
            $stmt->bindValue(1, $item['country_code'], \PDO::PARAM_STR);
            $stmt->execute();
            if (0 === $stmt->rowCount()) {
                yield ['INSERT INTO countries (country_code, country_name) VALUES (?, ?)', [$item['country_code'], $item['country_name']]];
            }

            yield ['REPLACE INTO presidents (country_code, president_name) VALUES (?, ?)', [$item['country_code'], $item['president']]];

        }
    )
    ->loadInto(
        $loader = function (\Generator $queries) use ($pdo) {
            foreach ($queries as $query) {
                list($sql, $params) = $query;
                $stmt = $pdo->prepare($sql);
                foreach ($params as $i => $value) {
                    $stmt->bindValue($i + 1, $value);
                }
                $stmt->execute();
            }
        }
    )
    ->createEtl();

$etl->process(__DIR__.'/presidents.json'); // The JsonExtractor will convert that file to a PHP array

As you can see, from a single item, we loaded up to 2 queries.

Your extractors, transformers and loaders can implement ExtractorInterface, TransformerInterface or LoaderInterface as well as being simple callables.

Skipping items

Each extractor / transformer / loader callback gets the current Etl object injected in their arguments.

This allows you to ask the ETL to skip an item, or even to stop the whole process:

use BenTools\ETL\Etl;
use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Transformer\CallableTransformer;

$fruits = [
    'apple',
    'banana',
    'strawberry',
    'pineapple',
    'pear',
];


$storage = [];
$etl = EtlBuilder::init()
    ->transformWith(new CallableTransformer('strtoupper'))
    ->loadInto(
        function ($generated, $key, Etl $etl) use (&$storage) {
            foreach ($generated as $fruit) {
                if ('BANANA' === $fruit) {
                    $etl->skipCurrentItem();
                    break;
                }
                if ('PINEAPPLE' === $fruit) {
                    $etl->stopProcessing();
                    break;
                }
                $storage[] = $fruit;
            }
        })
    ->createEtl();

$etl->process($fruits);

var_dump($storage); // ['APPLE', 'STRAWBERRY']

Events

Now you're wondering how you can hook on the ETL lifecycle, to log things, handle exceptions, ... This library ships with a built-in Event Dispatcher that you can leverage when:

  • The ETL starts
  • An item has been extracted
  • The extraction failed
  • An item has been transformed
  • Transformation failed
  • Loader is initialized (1st item is about to be loaded)
  • An item has been loaded
  • Loading failed
  • An item has been skipped
  • The ETL was stopped
  • A flush operation was completed
  • A rollback operation was completed
  • The ETL completed the whole process.

The ItemEvents (on extract, transform, load) will allow you to mark the current item to be skipped, or even handle runtime exceptions. Let's take another example:

use BenTools\ETL\EtlBuilder;
use BenTools\ETL\EventDispatcher\Event\ItemExceptionEvent;

$fruits = [
    'apple',
    new \RuntimeException('Is tomato a fruit?'),
    'banana',
];


$storage = [];
$etl = EtlBuilder::init()
    ->transformWith(
        function ($item, $key) {
            if ($item instanceof \Exception) {
                throw $item;
            }

            yield $key => $item;
        })
    ->loadInto(
        function (iterable $transformed) use (&$storage) {
            foreach ($transformed as $fruit) {
                $storage[] = $fruit;
            }
        })
    ->onTransformException(
        function (ItemExceptionEvent $event) {
            echo $event->getException()->getMessage(); // Is tomato a fruit?
            $event->ignoreException();
        })
    ->createEtl();

$etl->process($fruits);

var_dump($storage); // ['apple', 'banana']

Here, we intentionnally threw an exception during the transform operation. But thanks to the event dispatcher, we could tell the ETL this exception can be safely ignored and it can pursue the rest of the process.

You can attach as many event listeners as you wish, and sort them by priority.

Recipes

A recipe is an ETL pattern that can be reused through different tasks. If you want to log everything that goes through an ETL for example, use our built-in Logger recipe:

use BenTools\ETL\EtlBuilder;
use BenTools\ETL\Recipe\LoggerRecipe;

$etl = EtlBuilder::init()
    ->useRecipe(new LoggerRecipe($logger))
    ->createEtl();

You can also create your own recipes:

use BenTools\ETL\EtlBuilder;
use BenTools\ETL\EventDispatcher\Event\ItemEvent;
use BenTools\ETL\Extractor\JsonExtractor;
use BenTools\ETL\Loader\CsvFileLoader;
use BenTools\ETL\Recipe\Recipe;


class JSONToCSVRecipe extends Recipe
{
    /**
     * @inheritDoc
     */
    public function updateBuilder(EtlBuilder $builder): EtlBuilder
    {
        return $builder
            ->extractFrom(new JsonExtractor())
            ->loadInto($loader = CsvFileLoader::factory(['delimiter' => ';']))
            ->onLoaderInit(
                function (ItemEvent $event) use ($loader) {
                    $loader::factory(['keys' => array_keys($event->getItem())], $loader);
                })
            ;
    }

}

$builder = EtlBuilder::init()->useRecipe(new JSONToCSVRecipe());
$etl = $builder->createEtl();
$etl->process(__DIR__.'/presidents.json', __DIR__.'/presidents.csv');

The above example will result in presidents.csv containing:

country_code;country_name;president
US;USA;"Donald Trump"
RU;Russia;"Vladimir Putin"

To sum up, a recipe is a kind of an ETLBuilder factory, but keep in mind that a recipe will only add event listeners to the existing builder but can also replace the builder's extractor, transformer and/or loader.

Tests

./vendor/bin/phpunit

License

MIT