labrodev / document-sampler
Extracts a structured representative sample from long documents for downstream AI processing.
Requires
- php: ^8.5
Requires (Dev)
- pestphp/pest: ^3.0
- phpstan/phpstan: ^2.0
README
Pure PHP library that extracts a structured, representative sample from a document of any length. No framework dependency, no HTTP calls, no AI — just text processing.
Designed as the input layer for downstream AI-powered packages such as relevance checkers, prompt injection detectors, and depersonalisation services.
Requirements
- PHP
^8.5
Installation
composer require labrodev/document-sampler
Basic usage
use Labrodev\DocumentSampler\DocumentSampler; $result = (new DocumentSampler())->sample($rawText); $result->intro // opening chars — title and introduction $result->outline // extracted section headings from anywhere in the document $result->middle // fixed window centred on the document midpoint $result->tail // closing chars — conclusion and sign-off $result->text // all samples joined with separators $result->charCount // character count of the combined sample $result->originalCharCount // character count of the original document
Custom window sizes
By default each zone uses the window defined on the DocumentPart enum. Pass any subset to the constructor to override:
// Override specific zones — unset zones use the enum defaults $sampler = new DocumentSampler( intro: 2000, middle: 300, ); $result = $sampler->sample($rawText);
How it works
The sampler partitions every document into four fixed-size windows regardless of document length:
| Zone | Default window | What it captures |
|---|---|---|
intro |
1000 chars | Title, abstract, opening paragraphs |
outline |
500 chars | Section headings (# Markdown, 1.1 Numbered, ALL-CAPS lines) |
middle |
500 chars | Window centred on the document midpoint |
tail |
500 chars | Closing paragraphs, conclusion, signature |
Windows are fixed — a 400-page PDF gets the same sized sample as a one-page memo. The goal is a compact, representative fingerprint of the document, not a summary.
Exporting results
JSON
$result->toJson();
{
"meta": {
"originalCharCount": 50000,
"sampledCharCount": 2300
},
"samples": {
"intro": "...",
"outline": "...",
"middle": "...",
"tail": "..."
}
}
Markdown
$result->toMd();
## Document Sample **Original size:** 50,000 chars **Sampled size:** 2,300 chars ### Intro ... ### Outline ... ### Middle ... ### Tail ...
Empty zones are omitted from both outputs.
Default window sizes
Window sizes are defined on the DocumentPart enum and can be read at runtime:
use Labrodev\DocumentSampler\Enums\DocumentPart; DocumentPart::Intro->chars(); // 1000 DocumentPart::Outline->chars(); // 500 DocumentPart::Middle->chars(); // 500 DocumentPart::Tail->chars(); // 500
When to use this
- Before calling an AI API — reduce a large document to a structured excerpt that fits in a context window without losing structural information.
- Relevance checking — feed
$result->textto a classifier to decide whether a document is relevant before processing it in full. - Prompt injection detection — scan a compact sample for malicious instructions before passing untrusted documents to an LLM.
- Depersonalisation — run PII detection over a representative sample before deciding whether to redact the full document.
- Document classification — use the outline and intro zones to classify document type without reading the entire file.
Testing
composer test
Static analysis
composer analyse
Author
Petro Lashyn — contact@labrodev.com
License
MIT