masakielastic / icu4x
ICU4X PHP extension for Unicode text segmentation and East Asian Width, built with ext-php-rs
Package info
github.com/masakielastic/php-ext-icu4x
Type:php-ext
Ext name:ext-icu4x
pkg:composer/masakielastic/icu4x
Requires
- php: >=8.1
README
A PHP extension for Unicode text segmentation using ICU4X, built with ext-php-rs.
Features
- Grapheme Cluster Segmentation: Proper handling of Unicode text including emojis and complex scripts
- East Asian Width Support: Calculate display width of Unicode characters for proper text layout
- Multiple APIs: Both object-oriented class API and functional API
- SPL Interface Support: Full integration with PHP's Standard PHP Library interfaces
- ICU4X 2.0: Built on the latest ICU4X Unicode library
- Memory Efficient: Rust-based implementation with zero-copy optimizations
Installation
Via PIE (Recommended)
PIE (PHP Installer for Extensions) allows you to install this extension directly from packagist.org.
pie install masakielastic/icu4x
Then add the extension to your php.ini:
extension=icu4x
Manual Build
Prerequisites
- Rust 1.70+
- PHP 8.1+
- ext-php-rs 0.14.0
- Clone the repository:
git clone https://github.com/masakielastic/php-ext-icu4x.git
cd php-ext-icu4x
- Build and install the extension:
cargo build --release
sudo cp target/release/libicu4x.so $(php-config --extension-dir)/icu4x.so
- Add to php.ini:
echo "extension=icu4x" >> $(php-config --ini-path)/php.ini
Usage
Function API (Recommended)
<?php // Basic usage $iterator = icu4x_segmenter("Hello World"); foreach ($iterator as $segment) { echo $segment . "\n"; } // With parameters $iterator = icu4x_segmenter("こんにちは👋世界", "grapheme", null); echo "Total segments: " . count($iterator) . "\n"; // Complex Unicode text $text = "🇺🇸🏳️🌈👨👩👧👦"; $segments = icu4x_segmenter($text); foreach ($segments as $i => $segment) { echo "[$i] => '$segment'\n"; }
Class API
<?php // Create segmenter instance $segmenter = new ICU4X\Segmenter('grapheme', null); // Segment text $iterator = $segmenter->segment("Hello World"); // Iterate over segments foreach ($iterator as $segment) { echo $segment . "\n"; } // Use SPL interfaces echo "Count: " . count($iterator) . "\n"; echo "Is countable: " . ($iterator instanceof Countable ? "Yes" : "No") . "\n"; echo "Is iterable: " . ($iterator instanceof IteratorAggregate ? "Yes" : "No") . "\n";
API Reference
Function API
icu4x_segmenter(string $text, string $mode = 'grapheme', ?string $locale = null): ICU4X\SegmentIterator
Segments the input text into grapheme clusters.
Parameters:
$text(string): The text to segment$mode(string, optional): Segmentation mode, currently only 'grapheme' is supported$locale(string|null, optional): Locale for segmentation rules
Returns: ICU4X\SegmentIterator - An iterator over text segments
icu4x_eaw_width(string $char, ?string $locale = null): int
Calculate the display width of a Unicode character based on its East Asian Width property.
Parameters:
$char(string): The character to calculate width for (only first character is used)$locale(string|null, optional): Locale for ambiguous character handling
Returns: int - Display width (1 or 2), or -1 on error
Examples:
echo icu4x_eaw_width('A'); // 1 (Narrow) echo icu4x_eaw_width('あ'); // 2 (Wide) echo icu4x_eaw_width('ア'); // 1 (Halfwidth) echo icu4x_eaw_width('A'); // 2 (Fullwidth) echo icu4x_eaw_width('§'); // 1 (Ambiguous, default) echo icu4x_eaw_width('§', 'ja'); // 2 (Ambiguous, Japanese locale)
Class API
ICU4X\Segmenter
Main segmenter class for text segmentation.
Constructor:
new ICU4X\Segmenter(string $mode = 'grapheme', ?string $locale = null)
Methods:
segment(string $text): ICU4X\SegmentIterator- Segment the input textgetMode(): string- Get the current segmentation modegetLocale(): ?string- Get the current locale
ICU4X\SegmentIterator
Iterator class for accessing segmentation results.
Implements: IteratorAggregate, Countable
Methods:
count(): int- Get the number of segmentsgetIterator(): ICU4X\InternalIterator- Get internal iteratortoArray(): array- Convert to array
Examples
Basic Text Segmentation
// English text $text = "Hello, world!"; $segments = icu4x_segmenter($text); // Output: ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']
Unicode and Emoji Support
// Japanese text with emoji $text = "こんにちは👋世界"; $segments = icu4x_segmenter($text); // Output: ['こ', 'ん', 'に', 'ち', 'は', '👋', '世', '界'] // Complex emoji sequences $text = "👨👩👧👦"; $segments = icu4x_segmenter($text); // Output: ['👨👩👧👦'] (single family emoji)
Working with SPL Interfaces
$iterator = icu4x_segmenter("Hello"); // Countable interface echo count($iterator); // 5 // IteratorAggregate interface foreach ($iterator as $index => $segment) { echo "Position $index: $segment\n"; } // Check interface implementation var_dump($iterator instanceof Countable); // true var_dump($iterator instanceof IteratorAggregate); // true
East Asian Width Examples
// Basic width calculation $text = "Hello世界"; $width = 0; for ($i = 0; $i < mb_strlen($text); $i++) { $char = mb_substr($text, $i, 1); $width += icu4x_eaw_width($char); } echo "Display width: $width\n"; // 9 // Locale-specific handling of ambiguous characters $ambiguous = "§±×÷"; echo "Default: "; for ($i = 0; $i < mb_strlen($ambiguous); $i++) { $char = mb_substr($ambiguous, $i, 1); echo icu4x_eaw_width($char); } echo "\n"; // 1111 echo "Japanese: "; for ($i = 0; $i < mb_strlen($ambiguous); $i++) { $char = mb_substr($ambiguous, $i, 1); echo icu4x_eaw_width($char, 'ja'); } echo "\n"; // 2222
Testing
Run the test suite:
# Build in debug mode cargo build # Basic functionality test php -d extension=target/debug/libicu4x.so tests/basic_test.php # Function API test php -d extension=target/debug/libicu4x.so tests/function_test.php # East Asian Width test php -d extension=target/debug/libicu4x.so tests/eaw_width_test.php
For release builds:
# Build in release mode cargo build --release # Run tests with release build php -d extension=target/release/libicu4x.so tests/basic_test.php
Performance
The extension is built on Rust and ICU4X, providing:
- High Performance: Rust's zero-cost abstractions and ICU4X optimizations
- Memory Efficiency: Minimal memory overhead with proper resource management
- Unicode Compliance: Full Unicode 15.0+ support with correct grapheme cluster handling
Supported Unicode Features
- ✅ Grapheme cluster segmentation
- ✅ East Asian Width property calculation
- ✅ Emoji sequences (including ZWJ sequences)
- ✅ Complex scripts (Arabic, Devanagari, etc.)
- ✅ Regional indicator sequences (flag emojis)
- ✅ Modifier sequences
- ✅ Locale-aware ambiguous character handling
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
Development Setup
# Install Rust curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Build cargo build # Run tests php -d extension=target/debug/libicu4x.so tests/basic_test.php
Commit Message Convention
This project follows Conventional Commits:
feat:- New featuresfix:- Bug fixesdocs:- Documentation changeschore:- Maintenance taskstest:- Test additions or modifications
License
[Add your license information here]
Acknowledgments
- ICU4X - Unicode components for Rust
- ext-php-rs - PHP extension framework for Rust
- Unicode Consortium - Unicode standards