README

A framework-agnostic PHP library for splitting text and files into meaningful chunks, using pluggable strategies and a composable post-processing pipeline.

Installation
Core Features
Quick Start
Chunking Strategies
Post-Processors
Configuration Reference
Custom Readers
Performance
Datasets
Development

Installation

composer require ecourty/text-chunker

Requirements: PHP >= 8.3

Core Features

9 built-in strategies: paragraph, sentence, fixed-size, dialogue, markdown, word count, regex, line, recursive
8 built-in post-processors: overlap, token limit, metadata enrichment, filtering, chunk merger, text normalization, deduplication, regex replace
Streaming architecture: processes large files in 8KB buffers — minimal memory usage
Works with files and strings: setFile() or setText()
Fully extensible: implement your own strategies and post-processors
Zero framework dependencies

Quick Start

use Ecourty\TextChunker\TextChunker;
use Ecourty\TextChunker\Strategy\ParagraphChunkingStrategy;

$chunker = new TextChunker();

foreach ($chunker->setFile('document.txt')->chunk(new ParagraphChunkingStrategy()) as $chunk) {
    echo $chunk->getText();       // chunk content
    echo $chunk->getPosition();   // index in the sequence
    print_r($chunk->getMetadata()); // strategy, length, etc.
}

Chunk from a string:

$chunker = new TextChunker();

foreach ($chunker->setText($myText)->chunk(new SentenceChunkingStrategy()) as $chunk) {
    // ...
}

Chunking Strategies

Strategy	Splits on	Key options
`ParagraphChunkingStrategy`	Double newlines (`\n\n`)	—
`SentenceChunkingStrategy`	Sentence-ending punctuation (`. ! ?`)	—
`FixedSizeChunkingStrategy`	Fixed character count	`chunkSize` (default: 1000)
`DialogueChunkingStrategy`	Dialogue lines, context-aware grouping	`targetChunkSize`, `minChunkSize`
`MarkdownChunkingStrategy`	Markdown headers (`#` to `######`)	`minHeadingLevel`, `maxHeadingLevel`
`WordCountChunkingStrategy`	Fixed word count, respects word boundaries	`wordCount` (default: 200)
`RegexChunkingStrategy`	Configurable regex pattern	`pattern`, `delimiterPosition` (`None` \| `Prefix` \| `Suffix`)
`LineChunkingStrategy`	N consecutive lines per chunk	`linesPerChunk` (default: 10)
`RecursiveChunkingStrategy`	Cascade of strategies with a size limit	`strategies[]`, `maxChunkSize`

RecursiveChunkingStrategy applies strategies[0] to the stream, and immediately re-splits any chunk exceeding maxChunkSize using strategies[1], then strategies[2], etc. Streaming-safe — never buffers more than one chunk at a time.

Post-Processors

Post-processors are applied in sequence after chunking. Chain them with withPostProcessor().

Post-processor	Description	Key options
`OverlappingChunkPostProcessor`	Prepends the tail of the previous chunk for context continuity	`overlapSize` (default: 200)
`TokenLimitPostProcessor`	Splits chunks exceeding a token budget	`maxTokens`, `charactersPerToken`
`MetadataEnricherPostProcessor`	Adds `chunk_index`, `total_chunks`, `word_count`, `char_count`, `source`	—
`ChunkFilterPostProcessor`	Removes empty or too-short chunks	`minLength`, `removeEmpty`
`ChunkMergerPostProcessor`	Merges consecutive small chunks until `minChunkSize` is reached	`minChunkSize` (default: 200), `separator`
`TextNormalizationPostProcessor`	Collapses whitespace, trims lines, strips control characters	`collapseWhitespace`, `trimLines`, `stripControlChars`
`DeduplicationPostProcessor`	Removes duplicate chunks by md5 content hash; adds `content_hash` metadata	—
`RegexReplacePostProcessor`	Applies ordered `[pattern => replacement]` substitutions to each chunk's text	`replacements[]`

Configuration Reference

TextChunker

Method	Description
`setFile(string $path)`	Set source file (streamed)
`setText(string $text)`	Set source string
`withMetadata(array $meta)`	Attach global metadata to every chunk
`withPostProcessor(...)`	Add a post-processor to the pipeline
`withPostProcessors(...)`	Add multiple post-processors at once (variadic)
`withReader(ReaderInterface)`	Inject a custom reader (see below)
`chunk(ChunkingStrategyInterface)`	Returns a `Generator<Chunk>`

Chunk

Method	Returns
`getText()`	`string` — the chunk content
`getPosition()`	`int` — index in the sequence
`getMetadata()`	`array` — associated metadata
`getLength()`	`int` — character count
`withMetadata(array)`	New `Chunk` with merged metadata

Custom Readers

By default, setFile() reads from the local filesystem via LocalFileReader. To read from a remote source (S3, Azure Blob, SFTP, etc.), implement ReaderInterface and inject it via withReader().

ReaderInterface has a single method: readChunks(string $path, int $bufferSize): \Generator<string>. Yield string chunks of arbitrary size — the chunking strategies handle the rest. The $path passed to readChunks() is whatever string you gave to setFile(), so it can be an S3 key, a URI, or any identifier your reader understands.

Example with Flysystem (works with S3, Azure, SFTP, GCS, and more):

use League\Flysystem\Filesystem;
use Ecourty\TextChunker\Contract\ReaderInterface;
use Ecourty\TextChunker\TextChunker;
use Ecourty\TextChunker\Strategy\ParagraphChunkingStrategy;

class FlysystemReader implements ReaderInterface
{
    public function __construct(private Filesystem $filesystem) {}

    public function readChunks(string $path, int $bufferSize): \Generator
    {
        $stream = $this->filesystem->readStream($path);

        try {
            while (!feof($stream)) {
                $data = fread($stream, $bufferSize);
                if ($data === false) {
                    break;
                }
                yield $data;
            }
        } finally {
            fclose($stream);
        }
    }
}

// S3 example
$adapter = new \League\Flysystem\AwsS3V3\AwsS3V3Adapter($s3Client, 'my-bucket');
$filesystem = new Filesystem($adapter);

foreach (
    (new TextChunker())
        ->withReader(new FlysystemReader($filesystem))
        ->setFile('documents/report.txt')  // S3 key
        ->chunk(new ParagraphChunkingStrategy())
    as $chunk
) {
    echo $chunk->getText();
}

Performance

Benchmarked with PHPBench on real-world datasets (Bible KJV, Les Misérables, Encyclopaedia Britannica 11th Ed.). See BENCHMARKS.md for the full results.

Strategy throughput (Bible KJV, 4.26 MB):

Strategy	Time	Throughput
`SentenceChunkingStrategy`	43 ms	~98 MB/s
`FixedSizeChunkingStrategy`	44 ms	~98 MB/s
`LineChunkingStrategy`	46 ms	~93 MB/s
`ParagraphChunkingStrategy`	293 ms	~15 MB/s
`WordCountChunkingStrategy`	377 ms	~11 MB/s

Post-processor overhead (50 KB excerpt): all 8 processors run in < 3 ms. Chain freely.

The library is streaming-first — most strategies hold only ~2 MB in memory regardless of input file size.

Datasets

The datasets/ directory contains large text corpora used for benchmarking chunking strategies. All texts are public domain sourced from Project Gutenberg.

File	Source	Size	Notes
`bible_kjv.txt`	King James Bible (PG #10)	~4.5 MB	Great for sentence and paragraph benchmarks
`les_miserables.txt`	Les Misérables by Victor Hugo (PG #17489–17496)	~2.6 MB	All 5 tomes in French, ideal for paragraph chunking
`britannica/`	Encyclopaedia Britannica, 11th Edition	~118 MB	92 volumes of dense encyclopaedic text

Headers and Project Gutenberg license preambles can be stripped before benchmarking to work with clean content only.

Development

# Install dependencies
composer install

# Run tests
composer test

# Run PHPStan (level max)
composer phpstan

# Run CS fixer
composer cs-fix

# Run all checks
composer qa

Extending the library

Implement ChunkingStrategyInterface to create a custom strategy, or ChunkPostProcessorInterface for a custom post-processor. See AGENTS.md for detailed guidelines.

ecourty / text-chunker

Maintainers

Package info

Statistics

Security