ecourty/text-chunker

A framework-agnostic PHP library for chunking text and files using pluggable strategies and post-processors.

Maintainers

Package info

github.com/EdouardCourty/PHPTextChunker

pkg:composer/ecourty/text-chunker

Statistics

Installs: 32

Dependents: 0

Suggesters: 0

Stars: 4

Open Issues: 0

1.1.0 2026-03-04 20:06 UTC

This package is auto-updated.

Last update: 2026-03-04 20:08:43 UTC


README

PHP CI

A framework-agnostic PHP library for splitting text and files into meaningful chunks, using pluggable strategies and a composable post-processing pipeline.

Table of Contents

Installation

composer require ecourty/text-chunker

Requirements: PHP >= 8.3

Core Features

  • 9 built-in strategies: paragraph, sentence, fixed-size, dialogue, markdown, word count, regex, line, recursive
  • 8 built-in post-processors: overlap, token limit, metadata enrichment, filtering, chunk merger, text normalization, deduplication, regex replace
  • Streaming architecture: processes large files in 8KB buffers — minimal memory usage
  • Works with files and strings: setFile() or setText()
  • Fully extensible: implement your own strategies and post-processors
  • Zero framework dependencies

Quick Start

use Ecourty\TextChunker\TextChunker;
use Ecourty\TextChunker\Strategy\ParagraphChunkingStrategy;

$chunker = new TextChunker();

foreach ($chunker->setFile('document.txt')->chunk(new ParagraphChunkingStrategy()) as $chunk) {
    echo $chunk->getText();       // chunk content
    echo $chunk->getPosition();   // index in the sequence
    print_r($chunk->getMetadata()); // strategy, length, etc.
}

Chunk from a string:

$chunker = new TextChunker();

foreach ($chunker->setText($myText)->chunk(new SentenceChunkingStrategy()) as $chunk) {
    // ...
}

Chunking Strategies

Strategy Splits on Key options
ParagraphChunkingStrategy Double newlines (\n\n)
SentenceChunkingStrategy Sentence-ending punctuation (. ! ?)
FixedSizeChunkingStrategy Fixed character count chunkSize (default: 1000)
DialogueChunkingStrategy Dialogue lines, context-aware grouping targetChunkSize, minChunkSize
MarkdownChunkingStrategy Markdown headers (# to ######) minHeadingLevel, maxHeadingLevel
WordCountChunkingStrategy Fixed word count, respects word boundaries wordCount (default: 200)
RegexChunkingStrategy Configurable regex pattern pattern, delimiterPosition (None | Prefix | Suffix)
LineChunkingStrategy N consecutive lines per chunk linesPerChunk (default: 10)
RecursiveChunkingStrategy Cascade of strategies with a size limit strategies[], maxChunkSize

RecursiveChunkingStrategy applies strategies[0] to the stream, and immediately re-splits any chunk exceeding maxChunkSize using strategies[1], then strategies[2], etc. Streaming-safe — never buffers more than one chunk at a time.

Post-Processors

Post-processors are applied in sequence after chunking. Chain them with withPostProcessor().

Post-processor Description Key options
OverlappingChunkPostProcessor Prepends the tail of the previous chunk for context continuity overlapSize (default: 200)
TokenLimitPostProcessor Splits chunks exceeding a token budget maxTokens, charactersPerToken
MetadataEnricherPostProcessor Adds chunk_index, total_chunks, word_count, char_count, source
ChunkFilterPostProcessor Removes empty or too-short chunks minLength, removeEmpty
ChunkMergerPostProcessor Merges consecutive small chunks until minChunkSize is reached minChunkSize (default: 200), separator
TextNormalizationPostProcessor Collapses whitespace, trims lines, strips control characters collapseWhitespace, trimLines, stripControlChars
DeduplicationPostProcessor Removes duplicate chunks by md5 content hash; adds content_hash metadata
RegexReplacePostProcessor Applies ordered [pattern => replacement] substitutions to each chunk's text replacements[]

Configuration Reference

TextChunker

Method Description
setFile(string $path) Set source file (streamed)
setText(string $text) Set source string
withMetadata(array $meta) Attach global metadata to every chunk
withPostProcessor(...) Add a post-processor to the pipeline
withPostProcessors(...) Add multiple post-processors at once (variadic)
withReader(ReaderInterface) Inject a custom reader (see below)
chunk(ChunkingStrategyInterface) Returns a Generator<Chunk>

Chunk

Method Returns
getText() string — the chunk content
getPosition() int — index in the sequence
getMetadata() array — associated metadata
getLength() int — character count
withMetadata(array) New Chunk with merged metadata

Custom Readers

By default, setFile() reads from the local filesystem via LocalFileReader. To read from a remote source (S3, Azure Blob, SFTP, etc.), implement ReaderInterface and inject it via withReader().

ReaderInterface has a single method: readChunks(string $path, int $bufferSize): \Generator<string>. Yield string chunks of arbitrary size — the chunking strategies handle the rest. The $path passed to readChunks() is whatever string you gave to setFile(), so it can be an S3 key, a URI, or any identifier your reader understands.

Example with Flysystem (works with S3, Azure, SFTP, GCS, and more):

use League\Flysystem\Filesystem;
use Ecourty\TextChunker\Contract\ReaderInterface;
use Ecourty\TextChunker\TextChunker;
use Ecourty\TextChunker\Strategy\ParagraphChunkingStrategy;

class FlysystemReader implements ReaderInterface
{
    public function __construct(private Filesystem $filesystem) {}

    public function readChunks(string $path, int $bufferSize): \Generator
    {
        $stream = $this->filesystem->readStream($path);

        try {
            while (!feof($stream)) {
                $data = fread($stream, $bufferSize);
                if ($data === false) {
                    break;
                }
                yield $data;
            }
        } finally {
            fclose($stream);
        }
    }
}

// S3 example
$adapter = new \League\Flysystem\AwsS3V3\AwsS3V3Adapter($s3Client, 'my-bucket');
$filesystem = new Filesystem($adapter);

foreach (
    (new TextChunker())
        ->withReader(new FlysystemReader($filesystem))
        ->setFile('documents/report.txt')  // S3 key
        ->chunk(new ParagraphChunkingStrategy())
    as $chunk
) {
    echo $chunk->getText();
}

Performance

Benchmarked with PHPBench on real-world datasets (Bible KJV, Les Misérables, Encyclopaedia Britannica 11th Ed.). See BENCHMARKS.md for the full results.

Strategy throughput (Bible KJV, 4.26 MB):

Strategy Time Throughput
SentenceChunkingStrategy 43 ms ~98 MB/s
FixedSizeChunkingStrategy 44 ms ~98 MB/s
LineChunkingStrategy 46 ms ~93 MB/s
ParagraphChunkingStrategy 293 ms ~15 MB/s
WordCountChunkingStrategy 377 ms ~11 MB/s

Post-processor overhead (50 KB excerpt): all 8 processors run in < 3 ms. Chain freely.

The library is streaming-first — most strategies hold only ~2 MB in memory regardless of input file size.

Datasets

The datasets/ directory contains large text corpora used for benchmarking chunking strategies. All texts are public domain sourced from Project Gutenberg.

File Source Size Notes
bible_kjv.txt King James Bible (PG #10) ~4.5 MB Great for sentence and paragraph benchmarks
les_miserables.txt Les Misérables by Victor Hugo (PG #17489–17496) ~2.6 MB All 5 tomes in French, ideal for paragraph chunking
britannica/ Encyclopaedia Britannica, 11th Edition ~118 MB 92 volumes of dense encyclopaedic text

Headers and Project Gutenberg license preambles can be stripped before benchmarking to work with clean content only.

Development

# Install dependencies
composer install

# Run tests
composer test

# Run PHPStan (level max)
composer phpstan

# Run CS fixer
composer cs-fix

# Run all checks
composer qa

Extending the library

Implement ChunkingStrategyInterface to create a custom strategy, or ChunkPostProcessorInterface for a custom post-processor. See AGENTS.md for detailed guidelines.