rkr / data-diff
A handy tool for comparing structured data quickly in a key-value manner
Installs: 1 307
Dependents: 0
Suggesters: 0
Security: 0
Stars: 4
Watchers: 2
Forks: 0
Open Issues: 0
Requires
- php: >= 8.1
- ext-json: *
- ext-mbstring: *
- ext-pdo: *
- ext-pdo_sqlite: *
Requires (Dev)
- phpstan/phpstan: >= 0.1
- phpunit/phpunit: ^10.0
Suggests
- rkr/data-diff-helpers: Some Helpers for rkr/data-diff
- dev-master
- 0.3
- 0.2.3
- 0.2.2
- 0.2.1
- 0.2
- 0.1.18.1
- 0.1.18
- 0.1.17.3
- 0.1.17.2
- 0.1.17.1
- 0.1.17
- 0.1.16
- 0.1.15
- 0.1.14
- 0.1.13
- 0.1.12.3
- 0.1.12.2
- 0.1.12.1
- 0.1.12
- 0.1.11.7
- 0.1.11.6
- 0.1.11.5
- 0.1.11.4
- 0.1.11.3
- 0.1.11.2
- 0.1.11.1
- 0.1.11
- 0.1.10.4
- 0.1.10.3
- 0.1.10.2
- 0.1.10.1
- 0.1.10
- 0.1.9
- 0.1.8.2
- 0.1.8.1
- 0.1.8
- 0.1.7
- 0.1.6.1
- 0.1.6
- 0.1.5.1
- 0.1.5
- 0.1.4.1
- 0.1.4
- 0.1.3.3
- 0.1.3.2
- 0.1.3.1
- 0.1.3
- 0.1.2
- 0.1.1
- 0.1
This package is auto-updated.
Last update: 2025-04-01 15:01:24 UTC
README
A handy tool for comparing structured data quickly in a key-value manner
composer
Support for PHPStan
Add the following to your phpstan.neon
file:
includes: - vendor/rkr/data-diff/extension.neon
WTF
This component is useful if you have a large amount of structured data to import into a local database and you want to identify changes without overwriting everything on each run. Instead, you can determine what has actually changed and take appropriate actions.
Usage
Initially, you have two two-dimensional data lists that you want to compare. Typically, some columns in such a data list indicate the actual differences in terms of new and missing rows. Other columns may indicate changes in existing rows. Additionally, some columns may not trigger any actions but their data could be necessary for subsequent processing.
For example, consider having some article metadata from an external data source that you would like to import into a local database. The external data should be imported into the local database, and you want to take action whenever a dataset is added, removed, or changed (e.g., logging).
External Data:
name;reference;price;stock;last-change
Some Notebook;B0001;1499.90;1254;2016-04-01T10:00:00+02:00
A Hairdryer;C0001;49.95;66;2016-04-01T10:00:00+02:00
A Pencil;D0001;2.9499;2481;2016-04-01T10:00:00+02:00
Local data:
name;reference;price;stock
A shiny Smartphone;A0001;519.99;213
A Hairdryer;C0001;49.95;12
A Pencil;D0001;2.95;2481
Each list contains three data rows. Both lists have a row that is not present in the other list, and the only common rows (A Hairdryer;C0001
and A Pencil;D0001
) exhibit differences in the price
and stock
columns, while the name
column remains identical. The current-datetime
column should not be compared, but it should be present in case of an insertion or update. The primary objective is to synchronize all changes from the external data source to the local database. Although it might be important to track changes in the current-datetime
column while other columns remain unchanged, this example demonstrates how to handle a scenario where this is not a priority.
The comparison result is derived by comparing two distinct key-value lists. The comparison involves three methods to identify added keys, missing keys, and changed data where keys are equal. To achieve this, it is essential to determine whether a particular row was added, removed, or changed. This task can be complex and depends on the specific data. In this example, certain rules are established, which may vary in different scenarios.
In this example, only the reference
column is used to determine if a row is new or has been removed. For instance, the local database contains a reference to an article A0001
that is not present in the external data, necessitating its removal from the local data. Conversely, B0001
is absent in the local data and should be added. The Hairdryer has a different stock, and the Pencil has a slightly different price. Since prices are stored locally with a decimal precision of two, the two pencil prices are considered equal, and the comparison should not report a change for the row D0001
.
First, it is necessary to define what constitutes a key and a value for the Storage
to understand the key-value list schema. The data is already in the correct format, so no transformation is required.
So, let's give some meaning to the columns:
- The
reference
column indicates whether a particular row is present or not. This serves as the unique identifier for each row. A row may have more than one identifier column (such asreference
andenvironment-id
), but in this case, there is only one identifier. - The
name
column should only be considered when a row is already present in the other list. - The
price
column should only be considered when a row is already present in the other list. - The
stock
column should only be considered when a row is already present in the other list. - The
last-change
column should not be checked at all.
Therefore, when constructing a key-value array for comparison, the key part is composed of the reference
column, and the value part is represented by the name
, price
, and stock
columns.
The key-value array of the first list would then appear as follows:
'B0001' => ['Some Notebook', 1499.90, 1254]
'C0001' => ['A Hairdryer', 49.95, 66]
'D0001' => ['A Pencil', 2.9499, 2481]
The key-value-array of the second-list would look like this:
'A0001' => ['A shiny Smartphone', 519.99, 213]
'C0001' => ['A Hairdryer', 49.95, 12]
'D0001' => ['A Pencil', 2.95, 2481]
Now, let's compare those arrays in three distinct ways:
What rows are present in the first list, but not in the second:
'B0001' => ['Some Notebook', 1499.90, 1254]
What rows are present in the second list, but not in the first:
'A0001' => ['A shiny Smartphone', 519.99, 213]
What rows are present in the first list, but have changed values compared to the second list?
'C0001' => ['A Hairdryer', 49.95, 66]
'D0001' => ['A Pencil', 2.9499, 2481]
You now have all the necessary information to identify the differences between the two lists.
Consider a special case: the pencil has a price of 2.9499
in the first list. However, since we only compare prices with a decimal precision of two, the prices are effectively identical, as the computed price for D0001
is 2.95
in both cases. This is where the Schema
component becomes relevant.
When defining a MemoryDiffStorage
, you specify two schemas: one for the key part and one for the value part:
<?php use DataDiff\MemoryDiffStorageBuilderFactory; use DataDiff\MemoryDiffStorage; $factory = new MemoryDiffStorageBuilderFactory(); $ds = $factory->createBuilder() ->addStringKey('reference') ->addStringValue('name') ->addMoneyValue('price') ->addIntValue('stock') ->build();
A MemoryDiffStorage
consists of two stores: StoreA
and StoreB
. You can insert as many rows with as many columns into each store as you want, provided the rows contain at least the columns defined in the schema. The columns must have appropriate names since these names are not translated automatically. However, you can specify a translation when adding rows using the second parameter of addRow
and addRows
. This means that if your columns have different names in the database and the other source, you must normalize those keys before inserting the data into each store.
Here is a example:
<?php use DataDiff\MemoryDiffStorageBuilderFactory; use DataDiff\MemoryDiffStorage; require 'vendor/autoload.php'; $factory = new MemoryDiffStorageBuilderFactory(); $ds = $factory->createBuilder() ->addStringKey('reference') ->addStringValue('name') ->addMoneyValue('price') ->addIntValue('stock') ->build(); $ds->storeA()->addRow(['name' => 'Some Notebook', 'reference' => 'B0001', 'price' => '1499.90', 'stock' => '1254', 'last-change' => '2016-04-01T10:00:00+02:00']); $ds->storeA()->addRow(['name' => 'A Hairdryer', 'reference' => 'C0001', 'price' => '49.95', 'stock' => '66', 'last-change' => '2016-04-01T10:00:00+02:00']); $ds->storeA()->addRow(['name' => 'A Pencil', 'reference' => 'D0001', 'price' => '2.9499', 'stock' => '2481', 'last-change' => '2016-04-01T10:00:00+02:00']); $ds->storeB()->addRow(['name' => 'A shiny Smartphone', 'reference' => 'A0001', 'price' => '519.99', 'stock' => '213']); $ds->storeB()->addRow(['name' => 'A Hairdryer', 'reference' => 'C0001', 'price' => '49.95', 'stock' => '12']); $ds->storeB()->addRow(['name' => 'A Pencil', 'reference' => 'D0001', 'price' => '2.95', 'stock' => '2481']);
A good rule of thumb is to use store a
for the data, you already have and to use store b
for the data to compare to (e.g. the data to import from an external data-source).
Next, we can query one of the stores to find differences in the lists. Since store a
holds our local data, we use store b
to query the differences:
Get all data-sets that are present in store b
but not in store a
:
foreach($ds->storeB()->getNew() as $row) { $data = $row->getData(); printf("This row is not present in store a: %s\n", $data['reference']); }
The result is This row is not present in store b: B0001
.
Get all data-sets that are present in store a
but not in store b
:
foreach($ds->storeB()->getMissing() as $row) { $data = $row->getForeignData(); printf("This row is not present in store a: %s\n", $data['reference']); }
The result is This row is not present in store a: A0001
.
Get all changed data-sets:
foreach($ds->storeB()->getChanged() as $row) { printf("This row is not present in store a: %s\n", $row->getDiffFormatted()); }
The result is This row is not present in store a: stock: 12 -> 66, last-change: -> 2016-04-01T10:00:00+02:00
.
Note that D0001
is absent from the result set. This is because the schema has normalized the decimal precision of the price
column, resulting in no detected differences.
Additionally, you can access the data divided into keys and values as defined in each schema. This is useful for constructing SQL statements, where keys can be used as WHERE
conditions in an UPDATE
statement, and values can represent the data to be changed (SET
).
print_r($row->getLocal()->getKeyData());
print_r($row->getLocal()->getValueData());
Example
<?php use DataDiff\MemoryDiffStorageBuilderFactory; use DataDiff\MemoryDiffStorage; require 'vendor/autoload.php'; $factory = new MemoryDiffStorageBuilderFactory(); $ds = $factory->createBuilder() ->addIntKey('client_id') ->addStringValue('description') ->addMoneyValue('total') ->build(); for($i=2; $i <= 501; $i++) { $row = ['client_id' => $i, 'description' => 'This is a test', 'total' => $i === 50 ? 60 : 59.98999, 'test' => $i % 2]; $ds->storeA()->addRow($row); } for($i=1; $i <= 500; $i++) { $row = ['client_id' => $i, 'description' => 'This is a test', 'total' => 59.98999, 'test' => $i % 3]; $ds->storeB()->addRow($row); } $res = $ds->storeA()->getNew(); foreach($res as $key => $value) { printf("Added : %s\n", $value['client_id']); } $res = $ds->storeA()->getChanged(); foreach($res as $key => $value) { printf("Changed: %s\n", $value['client_id']); } $res = $ds->storeA()->getMissing(); foreach($res as $key => $value) { printf("Removed: %s\n", $value['client_id']); } echo "\n"; $res = $ds->storeA()->getChanged(); foreach($res as $key => $value) { print_r($value->getDiff()); }
Output:
Added : 501
Changed: 50
Removed: 1
Array
(
[total] => Array
(
[local] => 60
[foreign] => 59.98999
)
[test] => Array
(
[local] => 0
[foreign] => 2
)
)