sorelvi / stream-reader
Stream reader with support for single-byte and multi-byte encodings
Requires
- php: >=8.2
Requires (Dev)
- infection/infection: ^0.29.8
- mikey179/vfsstream: ^1.6
- php-mock/php-mock-phpunit: ^2.15
- phpmd/phpmd: ^2.15
- phpstan/phpstan: ^2.1
- phpstan/phpstan-phpunit: ^2.0
- phpunit/phpunit: ^10.0
- slevomat/coding-standard: ^8.27
- squizlabs/php_codesniffer: ^4.0
This package is auto-updated.
Last update: 2026-05-08 19:03:00 UTC
README
Industrial-grade, memory-efficient streaming reader for PHP.
Reading gigabytes of text data (logs, SQL dumps, CSVs) in PHP can be tricky. Standard fread breaks multi-byte characters (UTF-8, UTF-16) if a chunk boundary splits a character in half. Sorelvi\StreamReader solves this problem by strictly analyzing byte sequences and ensuring every yielded chunk contains only valid, complete characters.
Key Features
- Memory Efficient: Uses Generators (
yield) to process files of any size (even terabytes) with constant O(1) memory usage. - Multibyte Safe: Guaranteed character integrity. Never splits a UTF-8 emoji or a UTF-16 surrogate pair across chunks.
- Zero Dependencies: Works on pure PHP (no
mbstringrequired). - Polyglot: Built-in support via
Presetenums:- UTF-8 (optimized bitwise scanning)
- UTF-16 (LE/BE auto-detection via BOM or manual config)
- UTF-32
- Fixed-byte encodings (ASCII, Windows-1251, ISO-8859-1 via
Preset::BYTE1) - Fixed-width protocols (
Preset::BYTE2throughPreset::BYTE10)
- Network/Pipe Ready: Built-in retry logic handles slow or non-blocking network streams.
- Resumable: Architecture supports pausing and resuming reading via
HandleContext. - Extensible: Full support for custom
StreamInterfaceandEstimatorInterfaceimplementations.
Requirements
- PHP 8.2 or higher
Installation
composer require sorelvi/stream-reader
Quick Start
Reading a Large File
The easiest way to read a file is using the static factory method with a Preset.
use Sorelvi\StreamReader\Reader; use Sorelvi\StreamReader\Enum\Preset; use Sorelvi\StreamReader\Exception\StreamReaderException; try { // 1. Create a reader for a UTF-8 file $reader = Reader::createForFile('/path/to/huge_dump.sql'); // Optional: Adjust chunk size (default is 64KB) $reader->setChunkLength(8192); // 8KB // 2. Iterate cleanly foreach ($reader->readChunk() as $chunk) { // $chunk is guaranteed to end on a valid character boundary // e.g. send to database or parse CSV line process_data($chunk); } } catch (StreamReaderException $e) { // Handles IO errors, corrupted streams, and encoding violations error_log('[' . $e->getCode() . '] ' . $e->getMessage()); }
Reading from String
use Sorelvi\StreamReader\Reader; $sql = "INSERT INTO ... ๐"; // String with emojis $reader = Reader::createForString($sql); foreach ($reader->readChunk() as $chunk) { echo $chunk; }
Per-call Chunk Size
The readChunk() method accepts an optional chunk size that overrides the instance default for that single call:
$reader = Reader::createForString('Hello World'); $reader->setChunkLength(2); // default: 2 bytes $first = $reader->readChunk(1)->current(); // 'H' โ one-off override $second = $reader->readChunk()->current(); // 'el' โ uses default 2
Integrity vs. Validation
Important: This library focuses on Stream Integrity, not Content Validation.
- What it DOES: It ensures that chunks are not split in the middle of a byte sequence defined by the encoding (e.g., it won't cut a 4-byte Emoji in half).
- What it does NOT: It does not validate if the bytes actually represent valid characters in that encoding. If you read a binary file as UTF-8, the reader will still process it without error, ensuring "structural" completeness based on UTF-8 bit-mask rules, even if the resulting output is garbage.
Garbage In โ Complete Garbage Out (not Half-Garbage).
Advanced Usage
Dependency Injection (Manual Construction)
For strict control over the estimator or context, you can instantiate the class manually.
use Sorelvi\StreamReader\Reader; use Sorelvi\StreamReader\StreamCompletenessEstimatorFactory; use Sorelvi\StreamReader\Enum\Preset; use Sorelvi\StreamReader\HandleContext; // 1. Prepare resources $inputFile = fopen('data.csv', 'rb'); $context = new HandleContext(); // 2. Create the specific estimator $estimator = StreamCompletenessEstimatorFactory::create(Preset::UTF16LE); // 3. Instantiate Reader with explicit dependencies $reader = new Reader($inputFile, $estimator, $context); foreach ($reader->readChunk() as $chunk) { // ... }
Resume Reading (Context Management)
The HandleContext tracks the number of bytes read. You can persist this state to resume reading later (e.g., between HTTP requests or background jobs).
use Sorelvi\StreamReader\Reader; use Sorelvi\StreamReader\HandleContext; use Sorelvi\StreamReader\Enum\Preset; // Step 1: Read some data $context = new HandleContext(); $reader = Reader::createForFile('large.log', $context, Preset::UTF16); foreach ($reader->readChunk() as $chunk) { process($chunk); if (should_pause()) { break; } } // Step 2: Save context $state = $context->toArray(); save_to_db($state); // --- Later --- // Step 3: Resume $newContext = HandleContext::fromArray(load_from_db()); // The reader will automatically seek to the correct position $reader = Reader::createForFile('large.log', $newContext, Preset::UTF16);
HandleContext API
| Method | Description |
|---|---|
getTotalReadBytes(): int |
Returns total bytes consumed so far. |
addTotalReadBytes(int): void |
Increments the byte counter. |
setTotalReadBytes(int): void |
Overrides the byte counter. |
resetTotalReadBytes(): void |
Resets counter to zero. |
toArray(): array |
Serializes context to a plain array (for persistence). |
fromArray(array): self |
Restores context from a previously serialized array. |
setParam(string $part, string $key, scalar|null): self |
Stores arbitrary state (used internally by estimators). |
getParam(string $part, string $key, $default): scalar|null |
Retrieves stored state. |
hasParam(string $part, string $key): bool |
Checks if a parameter is stored. |
Supporting UTF-16
The reader intelligently handles UTF-16 endianness:
- Uses the explicitly configured endianness if
Preset::UTF16LEorPreset::UTF16BEis specified. - For
Preset::UTF16(auto mode), detects endianness from the Byte Order Mark (BOM):0xFF 0xFEโ LE, anything else โ BE (per RFC 2781). - Correctly handles Surrogate Pairs (4-byte characters in UTF-16) to ensure they are not split across chunks.
- The detected endianness is stored in
HandleContextand reused on subsequent calls, ensuring consistency across resumed sessions.
// Auto-detect BOM $reader = Reader::createForFile('export_utf16.csv', null, Preset::UTF16); // Or specify explicitly $reader = Reader::createForFile('export_utf16le.csv', null, Preset::UTF16LE);
Network and Pipe Streams
When working with non-blocking or slow network streams, fread may return an empty string without reaching EOF. The Stream class handles this transparently with configurable retry logic.
use Sorelvi\StreamReader\Stream; use Sorelvi\StreamReader\Reader; use Sorelvi\StreamReader\StreamCompletenessEstimatorFactory; use Sorelvi\StreamReader\Enum\Preset; $socket = fsockopen('tcp://example.com', 9000); $stream = new Stream($socket); $stream->setMaxEmptyAttempts(50); // max retries on empty reads (default: 20) $stream->setAttemptsDelayMicroseconds(5000); // delay between retries in ยตs (default: 10000) $estimator = StreamCompletenessEstimatorFactory::create(Preset::UTF8); $reader = new Reader($stream, $estimator); foreach ($reader->readChunk() as $chunk) { // ... }
If the stream returns empty data more times than maxEmptyAttempts, a TooManyEmptyAttempts exception is thrown.
Custom Estimator
You can implement EstimatorInterface to support any custom encoding or fixed-width binary protocol.
use Sorelvi\StreamReader\EstimatorInterface; use Sorelvi\StreamReader\HandleContext; use Sorelvi\StreamReader\Reader; class MyProtocolEstimator implements EstimatorInterface { public function handle(string $buffer, HandleContext $context): int { // Return 0 if the buffer ends on a complete unit, // or N > 0 to request N more bytes before yielding. $remainder = strlen($buffer) % 7; // 7-byte fixed records return $remainder ? 7 - $remainder : 0; } public function getMaxAddReadBytes(): int { return 6; // max bytes ever requested in one handle() call } } $reader = Reader::createForFile('records.bin', null, new MyProtocolEstimator());
Custom Stream
You can implement StreamInterface to wrap any data source (e.g., in-memory buffers, S3 streams, database BLOBs).
use Sorelvi\StreamReader\StreamInterface; class MyCustomStream implements StreamInterface { public function isEndOfStream(): bool { /* ... */ } public function read(int $length): string { /* ... */ } public function skip(int $offset): void { /* ... */ } public function getCurrentPosition(): int { /* ... */ } } $reader = new Reader(new MyCustomStream(), $estimator);
Supported Presets
Use the Sorelvi\StreamReader\Enum\Preset enum to select your encoding:
| Preset | Description |
|---|---|
Preset::UTF8 |
Variable-width (1โ4 bytes), bitwise scan. |
Preset::UTF16 |
Variable-width (2 or 4 bytes), BOM auto-detect. |
Preset::UTF16LE |
UTF-16 Little Endian, explicit. |
Preset::UTF16BE |
UTF-16 Big Endian, explicit. |
Preset::UTF32 |
Fixed 4-byte width. Equivalent to Preset::BYTE4. |
Preset::BYTE1 |
Fixed 1-byte (ASCII, Windows-1251, ISO-8859-*, KOI8-R, etc.). |
Preset::BYTE2 |
Fixed 2-byte width. |
Preset::BYTE3 |
Fixed 3-byte width. |
Preset::BYTE4 |
Fixed 4-byte width. |
Preset::BYTE5โPreset::BYTE10 |
Fixed 5โ10 byte widths for custom binary protocols. |
Exception Reference
All library exceptions extend Sorelvi\StreamReader\Exception\StreamReaderException.
| Exception | Code | Thrown When |
|---|---|---|
FileNotAccessible |
303โ305 | File does not exist, is a directory, or is not readable. |
IsNotStream |
โ | Constructor receives a non-resource argument. |
CanNotCreateStream |
โ | fopen() fails internally. |
CanNotReadZeroBytes |
โ | Stream::read() called with $length < 1. |
CanNotReadZeroChunk |
โ | Reader::readChunk() called with $chunkLength < 1. |
ChunkLengthMustBePositive |
โ | Reader::setChunkLength() called with value < 1. |
ErrorReadingFromStream |
301, 302 | Stream became invalid or fread() returned false. |
TooManyEmptyAttempts |
โ | Empty-read retry limit exceeded (network/pipe streams). |
StreamDamaged |
101โ107 | Invalid or incomplete byte sequence detected in stream. |
CanNotRestoreReadingStream |
306 | Context byte offset cannot be seeked to in the stream. |
MaxAddReadMustBeZeroOrPositive |
โ | Custom estimator returns a negative value from getMaxAddReadBytes(). |
ErrorCode Enum
Sorelvi\StreamReader\Exception\ErrorCode provides integer codes for programmatic error handling:
use Sorelvi\StreamReader\Exception\StreamDamaged; use Sorelvi\StreamReader\Exception\ErrorCode; try { foreach ($reader->readChunk() as $chunk) { /* ... */ } } catch (StreamDamaged $e) { match ($e->getCode()) { ErrorCode::INCOMPLETE_SEQUENCE_AT_EOF->value => handleTruncated(), ErrorCode::ORPHAN_CONTINUATION_BYTE->value => handleCorrupted(), ErrorCode::INVALID_START_BYTE_DETECTED->value => handleInvalid(), default => throw $e, }; }
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please ensure to run static analysis tools (composer check) before submitting.
See CONTRIBUTING.md for full development workflow.
License
Licensed under the Apache License, Version 2.0.
Copyright 2026 Sorelvi.