arteq/tmx-utils

Read and write TMX files, simple element manipulation

v1.4.3 2020-12-15 12:41 UTC

This package is auto-updated.

Last update: 2024-05-15 20:14:21 UTC


README

Description

Simple library for handling Translation Memory eXchange files used in translation industry. Can be used to read/write or split large files into smaller ones.

More on TMX standard: http://xml.coverpages.org/tmxSpec971212.html

Installation

Run composer to fetch project from packagist:

$ composer require arteq/tmx-utils

Optionally you can run tests (phpUnit, CS-Fixer, PHPStan):

$ make

Note: code quality tests require PHP 7.x. If you run PHP 5.6 you can use simplified composer_php56.json file instead and only run phpUnit tests:

$ env COMPOSER=composer_php56.json composer install
$ make phpunit

The library itself will run just fine on both PHP 5.x and 7.x.

Reader

The Reader class is used to extract translation units and return them as multidimensional array. First level of array keys consists of translation unit id, second level key is segment language code and value is the segment text itself. All additional properties and element attributes are ignored.

Usage

$reader = new ArteQ\Tmx\Reader('file.tmx');

// get all translation units
$units = $reader->get();
var_dump($unit);

// get single translation unit by its id
$unit = $reader->get('tu-123');
var_dump($unit);

// get all translation units for given language code
$unitsLang = $reader->getLang('en_UK');
var_dump($unitsLang);

Writer

The Writer class is used to create TMX file. Translation units can be added to internal data memory one by one using set($tuid, $xmlLang, $value = '') method or all at once from array using setArray(array $data). This approach enables to futher manipulate data, ex. fetch translation unit by it's id, change or delete it. To save data to file use write().

Usage - simple

$tmx = new Writer('file.tmx');

// set two segments for one translation unit identified by id 'tuid-123'
$tmx->set('tuid-123', 'pl_PL', 'Tekst polski');
$tmx->set('tuid-123', 'en_EN', 'English text');

// add additional attribute
$tmx->setAttribute('id-123', 'creationid', 'user-123');

// add additional property
$tmx->setProperty('id-123', 'client', 'ACME Ltd.');

// save data to file
$tmx->write();

In case of very large files with a lot of translation units that you only want to save to TMX file without any data manipulation streamed approach should be used. Each translation unit is added one by one and flushed to disk every 1000 units. This keeps the memory usage low as the whole dataset doesn't have to be kept in memory.

Usage - streamed

// expected data format
$data = [
	'tuid-1' => [
		'pl' => 'tekst polski',
		'en' => 'english text',
		'_attributes' => [
			'attr1' => 'value1',
			'attr2' => 'value2',
		],
	],
	'tuid-2' => [
		'pl' => 'inny tekst',
		'_properties' => [
			'type1' => 'value1',
			'type2' => 'value2',
		],
	],
	'tuid-3' => [
		'pl' => 'kolejny tekst',
		'en' => 'another text',
	]
];

$tmx = new Writer('file.tmx');

$tmx->writeStart();
foreach ($data as $tuid => $tuvs)
{
	$tmx->writeTu($tuid, $tuvs);
}
$tmx->writeEnd();

See tests/WriterTest.php for more examples.

Credits

The initial version of TMX reader/writer was based on project from Maxime Maupeu and can be found on his GitHub

Splitter

Splitter can be used to read large TMX (file size of many GB) and save it in chunks of smaller files for easier manipulation in 3rd software. It uses stream XMLReader and XMLWriter so memory usage is very low as there is no need to read entire file into memory. The TMX header element of input file is inserted in every output file without any changes.

The number of translation units (<tu>) that will be saved in each chunk is set by setLimit() function.

Usage

$splitter = new ArteQ\Tmx\Splitter('large.tmx');

$splitter->setLimit(1000);
$splitter->split();
var_dump($splitter->getStats());

Notes

  1. UTF-8 encoding is assumed both in input and output files.
  2. Output files will have a number appended in their name to indicate order (ex.: file_000.tmx, file_001.tmx, etc..).
  3. Output files may not have perfect indent of XML elements due to the writeRaw() function used for simplicity.