masakielastic / striter
PHP extension for string iteration with grapheme, codepoint, and byte modes
Package info
github.com/masakielastic/php-ext-striter
Language:C
Type:php-ext
Ext name:ext-striter
pkg:composer/masakielastic/striter
Requires
- php: >=8.1
README
A PHP extension that provides advanced string iteration capabilities for UTF-8 strings with support for grapheme clusters, Unicode codepoints, and byte-level iteration.
Features
- Grapheme Cluster Iteration: Iterate over grapheme clusters (user-perceived characters) using PCRE2
- Unicode Codepoint Iteration: Iterate over individual Unicode codepoints
- Byte-level Iteration: Iterate over individual bytes for low-level string processing
- UTF-8 Safe: Proper handling of multibyte UTF-8 characters
- Standard PHP Interfaces: Implements Iterator, IteratorAggregate, and Countable interfaces for seamless integration
Installation
Requirements
- PHP 8.1 or higher
- PCRE2 library (libpcre2-dev)
Using PIE (Recommended)
PIE (PHP Installer for Extensions) is the recommended way to install this extension.
# Install PIE if you haven't already composer global require php/pie # Install the extension pie install masakielastic/striter
PIE automatically handles building and enabling the extension.
Build from Source
# Install dependencies (Ubuntu/Debian) sudo apt-get install libpcre2-dev # Build extension cd ext phpize ./configure --enable-striter make sudo make install
Enable Extension
Add to your php.ini:
extension=striter.so
Usage
Basic Usage
<?php // Create a string iterator $iterator = str_iter("Hello World"); // Iterate using foreach foreach ($iterator as $index => $char) { echo "[$index] => '$char'\n"; }
Iteration Modes
Grapheme Mode (Default)
Iterates over grapheme clusters (user-perceived characters):
<?php $text = "Helloπ"; $iterator = str_iter($text, "grapheme"); foreach ($iterator as $index => $char) { echo "[$index] => '$char'\n"; } // Output: // [0] => 'H' // [1] => 'e' // [2] => 'l' // [3] => 'l' // [4] => 'o' // [5] => 'π'
Codepoint Mode
Iterates over individual Unicode codepoints:
<?php $text = "Helloπ"; $iterator = str_iter($text, "codepoint"); foreach ($iterator as $index => $char) { echo "[$index] => '$char'\n"; }
Byte Mode
Iterates over individual bytes:
<?php $text = "Hello"; $iterator = str_iter($text, "byte"); foreach ($iterator as $index => $byte) { echo "[$index] => '" . ord($byte) . "'\n"; }
Using Countable Interface
<?php $text = "Helloπ"; $iterator = str_iter($text, "grapheme"); echo "Total characters: " . count($iterator) . "\n"; // Output: 6
Using IteratorAggregate Interface
<?php $text = "ABC"; $iterator = str_iter($text); // Get inner iterator for advanced operations $innerIterator = $iterator->getIterator(); foreach ($innerIterator as $key => $value) { echo "[$key] => '$value'\n"; }
API Reference
Functions
str_iter(string $str, string $mode = "grapheme")
Creates a new string iterator.
Parameters:
$str(string): The string to iterate over$mode(string, optional): Iteration mode - "grapheme", "codepoint", or "byte"
Returns: _StrIterIterator object
Iterator Methods
The returned iterator implements PHP's IteratorAggregate and Countable interfaces:
IteratorAggregate Methods:
getIterator(): Returns the iterator itself for nested iteration
Countable Methods:
count(): Returns the total number of elements in the iterator
Examples
Working with Emoji and Complex Characters
<?php // Complex emoji with skin tone modifiers $text = "π¨βπ©βπ§βπ¦ππ½"; $iterator = str_iter($text, "grapheme"); foreach ($iterator as $index => $char) { echo "Grapheme $index: '$char'\n"; }
Processing Japanese Text
<?php $text = "γγγ«γ‘γ―δΈη"; $iterator = str_iter($text, "grapheme"); foreach ($iterator as $index => $char) { echo "Character $index: '$char'\n"; }
Binary Data Processing
<?php $data = "\x48\x65\x6C\x6C\x6F"; // "Hello" in hex $iterator = str_iter($data, "byte"); foreach ($iterator as $index => $byte) { echo "Byte $index: 0x" . dechex(ord($byte)) . "\n"; }
Technical Details
Grapheme Cluster Detection
The extension uses PCRE2's \X pattern to detect grapheme clusters, which properly handles:
- Base characters with combining marks
- Emoji sequences
- Regional indicator sequences
- Hangul syllable sequences
UTF-8 Validation
The extension includes proper UTF-8 validation and handles invalid sequences gracefully by treating them as individual bytes.
Memory Management
The extension properly manages memory for string copies and PCRE2 objects, preventing memory leaks.
Testing
Run the included test files:
php tests/test_basic.php php tests/test_grapheme.php php tests/test_byte_mode.php php tests/test_emoji_bug.php php tests/test_invalid_utf8.php
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
License
This project is open source. Please refer to the project's license file for details.
Changelog
Version 0.1.0
- Initial release
- Support for grapheme, codepoint, and byte iteration modes
- PCRE2 integration for proper grapheme cluster detection
- Full Iterator interface implementation