README

A PHP extension for Unicode text segmentation using ICU4X, built with ext-php-rs.

Features

Grapheme Cluster Segmentation: Proper handling of Unicode text including emojis and complex scripts
East Asian Width Support: Calculate display width of Unicode characters for proper text layout
Multiple APIs: Both object-oriented class API and functional API
SPL Interface Support: Full integration with PHP's Standard PHP Library interfaces
ICU4X 2.0: Built on the latest ICU4X Unicode library
Memory Efficient: Rust-based implementation with zero-copy optimizations

Installation

Via PIE (Recommended)

PIE (PHP Installer for Extensions) allows you to install this extension directly from packagist.org.

pie install masakielastic/icu4x

Then add the extension to your php.ini:

extension=icu4x

Manual Build

Prerequisites

Rust 1.70+
PHP 8.1+
ext-php-rs 0.14.0

Clone the repository:

git clone https://github.com/masakielastic/php-ext-icu4x.git
cd php-ext-icu4x

Build and install the extension:

cargo build --release
sudo cp target/release/libicu4x.so $(php-config --extension-dir)/icu4x.so

Add to php.ini:

echo "extension=icu4x" >> $(php-config --ini-path)/php.ini

Usage

Function API (Recommended)

<?php

// Basic usage
$iterator = icu4x_segmenter("Hello World");
foreach ($iterator as $segment) {
    echo $segment . "\n";
}

// With parameters
$iterator = icu4x_segmenter("こんにちは👋世界", "grapheme", null);
echo "Total segments: " . count($iterator) . "\n";

// Complex Unicode text
$text = "🇺🇸🏳️‍🌈👨‍👩‍👧‍👦";
$segments = icu4x_segmenter($text);
foreach ($segments as $i => $segment) {
    echo "[$i] => '$segment'\n";
}

Class API

<?php

// Create segmenter instance
$segmenter = new ICU4X\Segmenter('grapheme', null);

// Segment text
$iterator = $segmenter->segment("Hello World");

// Iterate over segments
foreach ($iterator as $segment) {
    echo $segment . "\n";
}

// Use SPL interfaces
echo "Count: " . count($iterator) . "\n";
echo "Is countable: " . ($iterator instanceof Countable ? "Yes" : "No") . "\n";
echo "Is iterable: " . ($iterator instanceof IteratorAggregate ? "Yes" : "No") . "\n";

API Reference

Function API

`icu4x_segmenter(string $text, string $mode = 'grapheme', ?string $locale = null): ICU4X\SegmentIterator`

Segments the input text into grapheme clusters.

Parameters:

$text (string): The text to segment
$mode (string, optional): Segmentation mode, currently only 'grapheme' is supported
$locale (string|null, optional): Locale for segmentation rules

Returns: ICU4X\SegmentIterator - An iterator over text segments

`icu4x_eaw_width(string $char, ?string $locale = null): int`

Calculate the display width of a Unicode character based on its East Asian Width property.

Parameters:

$char (string): The character to calculate width for (only first character is used)
$locale (string|null, optional): Locale for ambiguous character handling

Returns: int - Display width (1 or 2), or -1 on error

Examples:

echo icu4x_eaw_width('A');        // 1 (Narrow)
echo icu4x_eaw_width('あ');       // 2 (Wide)
echo icu4x_eaw_width('ｱ');        // 1 (Halfwidth)
echo icu4x_eaw_width('Ａ');       // 2 (Fullwidth)
echo icu4x_eaw_width('§');        // 1 (Ambiguous, default)
echo icu4x_eaw_width('§', 'ja');  // 2 (Ambiguous, Japanese locale)

Class API

`ICU4X\Segmenter`

Main segmenter class for text segmentation.

Constructor:

new ICU4X\Segmenter(string $mode = 'grapheme', ?string $locale = null)

Methods:

segment(string $text): ICU4X\SegmentIterator - Segment the input text
getMode(): string - Get the current segmentation mode
getLocale(): ?string - Get the current locale

`ICU4X\SegmentIterator`

Iterator class for accessing segmentation results.

Implements: IteratorAggregate, Countable

Methods:

count(): int - Get the number of segments
getIterator(): ICU4X\InternalIterator - Get internal iterator
toArray(): array - Convert to array

Examples

Basic Text Segmentation

// English text
$text = "Hello, world!";
$segments = icu4x_segmenter($text);
// Output: ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!']

Unicode and Emoji Support

// Japanese text with emoji
$text = "こんにちは👋世界";
$segments = icu4x_segmenter($text);
// Output: ['こ', 'ん', 'に', 'ち', 'は', '👋', '世', '界']

// Complex emoji sequences
$text = "👨‍👩‍👧‍👦";
$segments = icu4x_segmenter($text);
// Output: ['👨‍👩‍👧‍👦'] (single family emoji)

Working with SPL Interfaces

$iterator = icu4x_segmenter("Hello");

// Countable interface
echo count($iterator); // 5

// IteratorAggregate interface
foreach ($iterator as $index => $segment) {
    echo "Position $index: $segment\n";
}

// Check interface implementation
var_dump($iterator instanceof Countable);        // true
var_dump($iterator instanceof IteratorAggregate); // true

East Asian Width Examples

// Basic width calculation
$text = "Hello世界";
$width = 0;
for ($i = 0; $i < mb_strlen($text); $i++) {
    $char = mb_substr($text, $i, 1);
    $width += icu4x_eaw_width($char);
}
echo "Display width: $width\n"; // 9

// Locale-specific handling of ambiguous characters
$ambiguous = "§±×÷";
echo "Default: ";
for ($i = 0; $i < mb_strlen($ambiguous); $i++) {
    $char = mb_substr($ambiguous, $i, 1);
    echo icu4x_eaw_width($char);
}
echo "\n"; // 1111

echo "Japanese: ";
for ($i = 0; $i < mb_strlen($ambiguous); $i++) {
    $char = mb_substr($ambiguous, $i, 1);
    echo icu4x_eaw_width($char, 'ja');
}
echo "\n"; // 2222

Testing

Run the test suite:

# Build in debug mode
cargo build

# Basic functionality test
php -d extension=target/debug/libicu4x.so tests/basic_test.php

# Function API test
php -d extension=target/debug/libicu4x.so tests/function_test.php

# East Asian Width test
php -d extension=target/debug/libicu4x.so tests/eaw_width_test.php

For release builds:

# Build in release mode
cargo build --release

# Run tests with release build
php -d extension=target/release/libicu4x.so tests/basic_test.php

Performance

The extension is built on Rust and ICU4X, providing:

High Performance: Rust's zero-cost abstractions and ICU4X optimizations
Memory Efficiency: Minimal memory overhead with proper resource management
Unicode Compliance: Full Unicode 15.0+ support with correct grapheme cluster handling

Supported Unicode Features

✅ Grapheme cluster segmentation
✅ East Asian Width property calculation
✅ Emoji sequences (including ZWJ sequences)
✅ Complex scripts (Arabic, Devanagari, etc.)
✅ Regional indicator sequences (flag emojis)
✅ Modifier sequences
✅ Locale-aware ambiguous character handling

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Ensure all tests pass
Submit a pull request

Development Setup

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build
cargo build

# Run tests
php -d extension=target/debug/libicu4x.so tests/basic_test.php

Commit Message Convention

This project follows Conventional Commits:

feat: - New features
fix: - Bug fixes
docs: - Documentation changes
chore: - Maintenance tasks
test: - Test additions or modifications

License

[Add your license information here]

Acknowledgments

ICU4X - Unicode components for Rust
ext-php-rs - PHP extension framework for Rust
Unicode Consortium - Unicode standards

masakielastic / icu4x

Maintainers

Package info

Statistics

Security

README

Features

Installation

Via PIE (Recommended)

Manual Build

Prerequisites

Usage

Function API (Recommended)

Class API

API Reference

Function API

`icu4x_segmenter(string $text, string $mode = 'grapheme', ?string $locale = null): ICU4X\SegmentIterator`

`icu4x_eaw_width(string $char, ?string $locale = null): int`

Class API

`ICU4X\Segmenter`

`ICU4X\SegmentIterator`

Examples

Basic Text Segmentation

Unicode and Emoji Support

Working with SPL Interfaces

East Asian Width Examples

Testing

Performance

Supported Unicode Features

Contributing

Development Setup

Commit Message Convention

License

Acknowledgments