ducks-project / encoding-repair
A robust, immutable, and extensible PHP library to handle charset conversion, detection, and repair (Double Encoding) with safe JSON wrappers. Optimized for Legacy ISO-8859-1 to UTF-8 migrations.
Fund package maintenance!
donaldinou
Open Collective
Installs: 1
Dependents: 0
Suggesters: 0
Security: 0
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
pkg:composer/ducks-project/encoding-repair
Requires
- php: >=7.4
- ext-json: *
- ext-mbstring: *
- psr/simple-cache: ^1.0
Requires (Dev)
- ducks-project/encoding-repair-subpackage-replace: 1.0.0
- ergebnis/composer-normalize: ^2.48
- friendsofphp/php-cs-fixer: ^3.0
- phpbench/phpbench: ^1.2
- phpmd/phpmd: ^1.5
- phpstan/phpstan: ^1.10
- phpstan/phpstan-phpunit: ^1.4
- phpunit/phpunit: ^9.5 || ^10.0
- squizlabs/php_codesniffer: ^3.10
- vimeo/psalm: ^4.30 || ^5.0
Suggests
- ext-fileinfo: For advanced encoding detection
- ext-iconv: For iconv conversion with transliteration support
- ext-intl: For UConverter support (best performance and precision [30% faster])
README
Advanced charset encoding converter with Chain of Responsibility pattern, auto-detection, double-encoding repair, and JSON safety.
π What's New in v1.2
Type Interpreter System
New optimized type-specific processing with Strategy + Visitor pattern:
// Custom property mapper for selective processing (60% faster!) use Ducks\Component\EncodingRepair\Interpreter\PropertyMapperInterface; class UserMapper implements PropertyMapperInterface { public function map(object $object, callable $transcoder, array $options): object { $copy = clone $object; $copy->name = $transcoder($object->name); $copy->email = $transcoder($object->email); // password NOT transcoded (security) return $copy; } } $processor = new CharsetProcessor(); $processor->registerPropertyMapper(User::class, new UserMapper());
Batch Processing API
New optimized batch processing methods for high-performance array conversion:
// Batch conversion with single encoding detection (40-60% faster!) $rows = $db->query("SELECT * FROM users")->fetchAll(); $utf8Rows = CharsetHelper::toCharsetBatch($rows, 'UTF-8', CharsetHelper::AUTO); // Detect encoding from array $encoding = CharsetHelper::detectBatch($items);
Service-Based Architecture
CharsetHelper now uses a service-based architecture following SOLID principles:
CharsetProcessor: Instanciable service with fluent APICharsetProcessorInterface: Service contract for dependency injection- Multiple instances: Different configurations for different contexts
- 100% backward compatible: Existing code works unchanged
// New way: Service instance $processor = new CharsetProcessor(); $processor->addEncodings('SHIFT_JIS')->resetDetectors(); $utf8 = $processor->toUtf8($data); // Old way: Static facade (still works) $utf8 = CharsetHelper::toUtf8($data);
PSR-16 Cache Support
Optional external cache integration for improved performance:
// Use built-in InternalArrayCache (default, optimized) use Ducks\Component\EncodingRepair\Detector\CachedDetector; use Ducks\Component\EncodingRepair\Detector\MbStringDetector; $detector = new CachedDetector(new MbStringDetector()); // InternalArrayCache used automatically (no TTL overhead) // Or use ArrayCache for TTL support use Ducks\Component\EncodingRepair\Cache\ArrayCache; $cache = new ArrayCache(); $detector = new CachedDetector(new MbStringDetector(), $cache, 3600); // Or use any PSR-16 implementation (Redis, Memcached, APCu) // $redis = new \Symfony\Component\Cache\Psr16Cache($redisAdapter); // $detector = new CachedDetector(new MbStringDetector(), $redis, 7200);
π Why CharsetHelper?
Unlike existing libraries, CharsetHelper provides:
- β Extensible architecture with Chain of Responsibility pattern
- β PSR-16 cache support for Redis, Memcached, APCu (NEW in v1.2)
- β Type-specific interpreters for optimized processing (NEW in v1.2)
- β Custom property mappers for selective object conversion (NEW in v1.2)
- β Multiple fallback strategies (UConverter β iconv β mbstring)
- β Smart auto-detection with multiple detection methods
- β Double-encoding repair for corrupted legacy data
- β Recursive conversion for arrays AND objects (not just arrays!)
- β Safe JSON encoding/decoding with automatic charset handling
- β Modern PHP with strict typing (PHP 7.4+)
- β Minimal dependencies (only PSR-16 interface for optional caching)
π Features
- Robust Transcoding: Implements a Chain of Responsibility pattern
trying best providers first (
Intl/UConverter->Iconv->MbString). - PSR-16 Cache Support: Optional external cache (Redis, Memcached, APCu) for detection results (NEW in v1.2).
- Type-Specific Interpreters: Optimized processing strategies per data type (NEW in v1.2).
- Custom Property Mappers: Selective object property conversion for security and performance (NEW in v1.2).
- Double-Encoding Repair: Automatically detects and fixes strings like
ΓΒ©tΓΒ©back toΓ©tΓ©. - Recursive Processing: Handles
string,array, andobjectrecursively. - Immutable: Objects are cloned before modification to prevent side effects.
- Safe JSON Wrappers: Prevents
json_encodefrom returningfalseon bad charsets. - Secure: Whitelisted encodings to prevent injection.
- Extensible: Register your own transcoders, detectors, interpreters, or cache providers without modifying the core.
- Modern Standards: PSR-12 compliant, strictly typed, SOLID architecture.
π Requirements
- PHP: 7.4, 8.0, 8.1, 8.2, or 8.3
- Extensions (required):
ext-mbstringext-json
- Extensions (recommended):
ext-intl
π¦ Installation
composer require ducks-project/charset-helper
Optional Extensions (for better performance)
# Ubuntu/Debian sudo apt-get install php-intl php-iconv # macOS (via Homebrew) brew install php@8.2 # Extensions are included by default # Windows # Enable in php.ini: extension=intl extension=iconv
π Quick Start
<?php use Ducks\Component\Component\EncodingRepair\CharsetHelper; // Simple UTF-8 conversion $utf8String = CharsetHelper::toUtf8($latinString); // Automatic encoding detection $data = CharsetHelper::toCharset($mixedData, 'UTF-8', CharsetHelper::AUTO); // Repair double-encoded strings $fixed = CharsetHelper::repair($corruptedString); // Safe JSON with encoding handling $json = CharsetHelper::safeJsonEncode($data);
ποΈ Usage
1. Basic Conversion
Convert between different character encodings:
use Ducks\Component\Component\EncodingRepair\CharsetHelper; $data = [ 'name' => 'GΓ©rard', // ISO-8859-1 string 'meta' => ['desc' => 'Ca coΓ»te 10β¬'] // Nested array with Euro sign ]; // Convert to UTF-8 $utf8 = CharsetHelper::toUtf8($data, CharsetHelper::WINDOWS_1252); // Convert to ISO-8859-1 (Windows-1252) $iso = CharsetHelper::toIso($data, CharsetHelper::ENCODING_UTF8); // Convert to any encoding $result = CharsetHelper::toCharset( $data, CharsetHelper::ENCODING_UTF16, CharsetHelper::ENCODING_UTF8 );
Note: We use Windows-1252 instead of strict ISO-8859-1 by default because it includes common characters like β¬, Ε, β’ which are missing in standard ISO.
Supported Encodings:
UTF-8UTF-16UTF-32ISO-8859-1Windows-1252(CP1252)ASCIIAUTO(automatic detection)
2. Automatic Encoding Detection
Let CharsetHelper detect the source encoding:
// Automatic detection $result = CharsetHelper::toCharset( $unknownData, CharsetHelper::ENCODING_UTF8, CharsetHelper::AUTO // Will auto-detect source encoding ); // Manual detection $encoding = CharsetHelper::detect($string); echo $encoding; // "UTF-8", "ISO-8859-1", etc. // Batch detection from array (faster for large datasets) $encoding = CharsetHelper::detectBatch($items); // With custom encoding list $encoding = CharsetHelper::detect($string, [ 'encodings' => ['UTF-8', 'Shift_JIS', 'EUC-JP'] ]);
3. Batch Processing (New in v1.2)
Optimized for processing large arrays with single encoding detection:
// Database migration with batch processing $rows = $db->query("SELECT * FROM users")->fetchAll(); // 10,000 rows // Slow: Detects encoding for each row (10,000 detections) $utf8Rows = array_map( fn($row) => CharsetHelper::toUtf8($row, CharsetHelper::AUTO), $rows ); // Fast: Detects encoding once (1 detection, 40-60% faster!) $utf8Rows = CharsetHelper::toCharsetBatch( $rows, CharsetHelper::ENCODING_UTF8, CharsetHelper::AUTO ); // CSV import example $csvData = array_map('str_getcsv', file('data.csv')); $utf8Csv = CharsetHelper::toCharsetBatch($csvData, 'UTF-8', CharsetHelper::AUTO);
4. Recursive Conversion (Arrays & Objects)
Convert nested data structures:
// Array conversion $data = [ 'name' => 'CafΓ©', 'city' => 'SΓ£o Paulo', 'items' => [ 'entrΓ©e' => 'CrΓ¨me brΓ»lΓ©e', 'plat' => 'BΕuf bourguignon' ] ]; $utf8Data = CharsetHelper::toUtf8($data, CharsetHelper::WINDOWS_1252); // Object conversion class User { public $name; public $email; } $user = new User(); $user->name = 'JosΓ©'; $user->email = 'josΓ©@example.com'; $utf8User = CharsetHelper::toUtf8($user, CharsetHelper::ENCODING_ISO); // Returns a cloned object with converted properties
5. Double-Encoding Repair π§
Fix strings that have been encoded multiple times (common with legacy databases):
// Example: "CafΓΒ©" (UTF-8 interpreted as ISO, then re-encoded as UTF-8) $corrupted = "CafΓΒ©"; $fixed = CharsetHelper::repair($corrupted); echo $fixed; // "CafΓ©" // With custom max depth $fixed = CharsetHelper::repair( $corrupted, CharsetHelper::ENCODING_UTF8, CharsetHelper::ENCODING_ISO, ['maxDepth' => 10] // Try to peel up to 10 encoding layers );
How it works:
- Detects valid UTF-8 strings
- Attempts to reverse-convert (UTF-8 β source encoding)
- Repeats until no more layers found or max depth reached
- Returns the cleaned string
6. Safe JSON Operations
Prevent JSON encoding/decoding errors caused by invalid UTF-8:
// Safe encoding (auto-repairs before encoding) $json = CharsetHelper::safeJsonEncode($data); // Safe decoding with charset conversion $data = CharsetHelper::safeJsonDecode( $json, true, // associative array 512, // depth 0, // flags CharsetHelper::ENCODING_UTF8, // target encoding CharsetHelper::WINDOWS_1252 // source encoding for repair ); // Throws RuntimeException on error with clear message try { $json = CharsetHelper::safeJsonEncode($invalidData); } catch (RuntimeException $e) { echo $e->getMessage(); // "JSON Encode Error: Malformed UTF-8 characters" }
7. Conversion Options
Fine-tune the conversion behavior:
$result = CharsetHelper::toCharset($data, 'UTF-8', 'ISO-8859-1', [ 'normalize' => true, // Apply Unicode NFC normalization (default: true) 'translit' => true, // Transliterate unavailable chars (default: true) 'ignore' => true, // Ignore invalid sequences (default: true) 'encodings' => ['UTF-8', 'ISO-8859-1', 'Shift_JIS'] // For detection ]);
Options explained:
normalize: Applies Unicode NFC normalization to UTF-8 output (combines accents)translit: Converts unmappable characters to similar ones (Γ© β e)ignore: Skips invalid byte sequences instead of failingencodings: List of encodings to try during auto-detection
π― Advanced Usage
Using CharsetProcessor Service (New in v1.1)
For better testability and flexibility, use the CharsetProcessor service directly:
use Ducks\Component\EncodingRepair\CharsetProcessor; // Create a processor instance $processor = new CharsetProcessor(); // Fluent API for configuration $processor ->addEncodings('SHIFT_JIS', 'EUC-JP') ->queueTranscoders(new MyCustomTranscoder()) ->resetDetectors(); // Use the processor $utf8 = $processor->toUtf8($data);
Multiple Processor Instances
// Production processor with strict encodings $prodProcessor = new CharsetProcessor(); $prodProcessor->resetEncodings()->addEncodings('UTF-8', 'ISO-8859-1'); // Import processor with permissive encodings $importProcessor = new CharsetProcessor(); $importProcessor->addEncodings('SHIFT_JIS', 'EUC-JP', 'GB2312'); // Both are independent $prodResult = $prodProcessor->toUtf8($data); $importResult = $importProcessor->toUtf8($legacyData);
Dependency Injection
use Ducks\Component\EncodingRepair\CharsetProcessorInterface; class MyService { private CharsetProcessorInterface $processor; public function __construct(CharsetProcessorInterface $processor) { $this->processor = $processor; } public function process($data) { return $this->processor->toUtf8($data); } } // Easy to mock in tests $mock = $this->createMock(CharsetProcessorInterface::class); $service = new MyService($mock);
Custom Property Mappers (New in v1.2)
Optimize object processing by converting only specific properties:
use Ducks\Component\EncodingRepair\Interpreter\PropertyMapperInterface; class UserMapper implements PropertyMapperInterface { public function map(object $object, callable $transcoder, array $options): object { $copy = clone $object; $copy->name = $transcoder($object->name); $copy->email = $transcoder($object->email); // password is NOT transcoded (security) // avatar_binary is NOT transcoded (performance) return $copy; } } $processor = new CharsetProcessor(); $processor->registerPropertyMapper(User::class, new UserMapper()); $user = new User(); $user->name = 'JosΓ©'; $user->password = 'secret123'; // Will NOT be converted $utf8User = $processor->toUtf8($user); // Performance: 60% faster for objects with 50+ properties
Custom Type Interpreters (New in v1.2)
Add support for custom data types:
use Ducks\Component\EncodingRepair\Interpreter\TypeInterpreterInterface; class ResourceInterpreter implements TypeInterpreterInterface { public function supports($data): bool { return \is_resource($data); } public function interpret($data, callable $transcoder, array $options) { $content = \stream_get_contents($data); $converted = $transcoder($content); $newResource = \fopen('php://memory', 'r+'); \fwrite($newResource, $converted); \rewind($newResource); return $newResource; } public function getPriority(): int { return 80; } } $processor->registerInterpreter(new ResourceInterpreter(), 80); $resource = fopen('data.txt', 'r'); $convertedResource = $processor->toUtf8($resource);
Registering Custom Transcoders
Extend CharsetHelper with your own conversion strategies using the TranscoderInterface:
use Ducks\Component\EncodingRepair\Transcoder\TranscoderInterface; class MyCustomTranscoder implements TranscoderInterface { public function transcode(string $data, string $to, string $from, array $options): ?string { if ($from === 'MY-CUSTOM-ENCODING') { return myCustomConversion($data, $to); } // Return null to try next transcoder in chain return null; } public function getPriority(): int { return 75; // Between iconv (50) and UConverter (100) } public function isAvailable(): bool { return extension_loaded('my_extension'); } } // Register with default priority CharsetHelper::registerTranscoder(new MyCustomTranscoder()); // Register with custom priority CharsetHelper::registerTranscoder(new MyCustomTranscoder(), 150); // Legacy: Register a callable CharsetHelper::registerTranscoder( function (string $data, string $to, string $from, array $options): ?string { if ($from === 'MY-CUSTOM-ENCODING') { return myCustomConversion($data, $to); } return null; }, 150 // Priority );
Registering Custom Detectors
Add custom encoding detection methods using the DetectorInterface:
use Ducks\Component\EncodingRepair\Detector\DetectorInterface; class MyCustomDetector implements DetectorInterface { public function detect(string $string, array $options): ?string { // Check for UTF-16LE BOM if (strlen($string) >= 2 && ord($string[0]) === 0xFF && ord($string[1]) === 0xFE) { return 'UTF-16LE'; } // Return null to try next detector return null; } public function getPriority(): int { return 150; // Higher than MbStringDetector (100) } public function isAvailable(): bool { return true; } } // Register with default priority CharsetHelper::registerDetector(new MyCustomDetector()); // Register with custom priority CharsetHelper::registerDetector(new MyCustomDetector(), 200); // Legacy: Register a callable CharsetHelper::registerDetector( function (string $string, array $options): ?string { if (strlen($string) >= 2 && ord($string[0]) === 0xFF && ord($string[1]) === 0xFE) { return 'UTF-16LE'; } return null; }, 200 // Priority );
Chain of Responsibility Pattern
The class uses a Chain of Responsibility pattern for both detection and transcoding.
CharsetHelper uses multiple strategies with automatic fallback:
UConverter (intl) β iconv β mbstring
β (fails) β (fails) β (always works)
Transcoder priorities:
- UConverter (priority: 100, requires
ext-intl): Best precision, supports many encodings - iconv (priority: 50): Good performance, supports transliteration
- mbstring (priority: 10): Universal fallback, most permissive
Custom transcoders can be registered with any priority value. Higher values execute first.
Detector priorities:
- CachedDetector (priority: 200, wraps MbStringDetector): Caches detection results
- MbStringDetector (priority: 100, requires
ext-mbstring): Fast and reliable using mb_detect_encoding - FileInfoDetector (priority: 50, requires
ext-fileinfo): Fallback using finfo class
Custom detectors can be registered with any priority value. Higher values execute first.
Cache Support (New in v1.2):
CachedDetector supports PSR-16 cache for persistent detection results:
// Default: InternalArrayCache (optimized, no TTL overhead) $detector = new CachedDetector(new MbStringDetector()); // With TTL: ArrayCache $cache = new ArrayCache(); $detector = new CachedDetector(new MbStringDetector(), $cache, 3600); // External: Redis, Memcached, APCu, etc. // $redis = new \Symfony\Component\Cache\Psr16Cache($redisAdapter); // $detector = new CachedDetector(new MbStringDetector(), $redis, 7200);
π Performance
Benchmarks on 10,000 conversions (PHP 8.2, i7-12700K):
| Operation | Time | Memory |
|---|---|---|
| Simple UTF-8 conversion | 45ms | 2MB |
| Array (100 items) | 180ms | 5MB |
| Auto-detection + conversion | 92ms | 3MB |
| Double-encoding repair | 125ms | 4MB |
| Safe JSON encode | 67ms | 3MB |
| Batch conversion (1000 items) | ~60% faster | Same |
| Object with custom mapper (50 props) | ~60% faster | Same |
Tips for performance:
- Install
ext-intlfor best performance (UConverter is fastest) - Use specific encodings instead of
AUTOwhen possible - Use batch methods (
toCharsetBatch()) for arrays > 100 items with AUTO detection - Cache detection results for repeated operations
π Comparison with Alternatives
| Feature | CharsetHelper | ForceUTF8 | Symfony String | Portable UTF-8 |
|---|---|---|---|---|
| Multiple fallback strategies | β | β | β | β |
| Extensible (CoR pattern) | β | β | β | β |
| Object recursion | β | β | β | β |
| Double-encoding repair | β | β | β | β οΈ |
| Safe JSON helpers | β | β | β | β |
| Multi-encoding support | β (7+) | β οΈ (2) | β οΈ | β οΈ (3) |
| Modern PHP (7.4+, strict types) | β | β | β | β οΈ |
| Zero dependencies | β | β | β | β |
π Use Cases
1. Database Migration (Latin1 β UTF-8)
// Migrate user table $users = $db->query("SELECT * FROM users")->fetchAll(); foreach ($users as $user) { $user = CharsetHelper::toUtf8($user, CharsetHelper::ENCODING_ISO); $db->update('users', $user, ['id' => $user['id']]); }
2. CSV Import with Unknown Encoding
$csv = file_get_contents('data.csv'); // Auto-detect and convert $utf8Csv = CharsetHelper::toCharset( $csv, CharsetHelper::ENCODING_UTF8, CharsetHelper::AUTO ); // Parse as UTF-8 $data = str_getcsv($utf8Csv);
3. API Response Sanitization
// Ensure API responses are always valid UTF-8 class ApiController { public function jsonResponse($data): JsonResponse { $json = CharsetHelper::safeJsonEncode($data); return new JsonResponse($json, 200, [], true); } }
4. Web Scraping
$html = file_get_contents('https://example.com'); // Detect encoding from HTML meta tags or auto-detect $encoding = CharsetHelper::detect($html); // Convert to UTF-8 for processing $utf8Html = CharsetHelper::toCharset( $html, CharsetHelper::ENCODING_UTF8, $encoding ); $dom = new DOMDocument(); $dom->loadHTML($utf8Html);
5. Legacy System Integration
// Fix double-encoded data from old system $legacyData = $oldSystem->getData(); // Repair corruption $clean = CharsetHelper::repair( $legacyData, CharsetHelper::ENCODING_UTF8, CharsetHelper::ENCODING_ISO ); // Process clean data processData($clean);
π§ͺ Testing
# Run tests composer test # Run tests with coverage composer unittest -- --coverage-html coverage # Static analysis composer phpstan # Auto-fix code style composer phpcsfixer-check
π Glossary
- Changelog
- How To
- About Middleware Pattern
- Type Interpreter System
CharsetHelperCharsetProcessorCharsetProcessorInterfacePrioritizedHandlerInterfaceTypeInterpreterInterfacePropertyMapperInterfaceInterpreterChainStringInterpreterArrayInterpreterObjectInterpreterTranscoderInterfaceCallableTranscoderIconvTranscoderMbStringTranscoderUConverterTranscoderDetectorInterfaceCallableDetectorMbStringDetectorFileInfoDetectorCallableAdapterTraitChainOfResponsibilityTraitCachedDetectorInternalArrayCacheArrayCache
π€ Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Ensure tests pass (
composer test) - Run static analysis (
composer analyse) - Fix code style (
composer cs-fix) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Setup
git clone https://github.com/ducks-project/encoding-repair.git cd encoding-repair composer install # Run full CI checks locally composer ci
Code Quality Standards
- PSR-12 / PER Coding Style
- PHPStan level 8
- 100% type coverage
- Minimum 90% code coverage
π License
This project is licensed under the MIT license see the LICENSE file for details.
π Credits
- Inspired by ForceUTF8 (simplified approach)
- Uses design patterns from Symfony (extensibility)
- Fallback strategies similar to Portable UTF-8
π Links
- Documentation: https://github.com/ducks-project/encoding-repair/wiki
- Issue Tracker: https://github.com/ducks-project/encoding-repair/issues
- Changelog: CHANGELOG.md
- Packagist: https://packagist.org/packages/ducks-project/encoding-repair
π¬ Support
- π§ Email: adrien.loyant@gmail.com
- π¬ Discussions: https://github.com/ducks-project/encoding-repair/discussions
- π Issues: https://github.com/ducks-project/encoding-repair/issues
β Star History
If this project helped you, please consider giving it a β on GitHub!
Made with β€οΈ by the Duck Project Team