README

Parsel provides an expressive PHP API for parsing PDFs, Office documents, and images. Your documents are processed locally, allowing you to extract plain text, structured page data, coordinates, and screenshots without sending files to an external service.

Parsel may return plain text, structured document data, page screenshots, or one page at a time for larger documents. It is designed to feel natural in PHP applications while still giving you access to advanced parser options when you need them.

use Shipfastlabs\Parsel;

$text = Parsel::file('invoice.pdf')->text();

$document = Parsel::file('invoice.pdf')
    ->pageRange(1, 5)
    ->withOcr(language: 'eng')
    ->withDpi(300)
    ->parse();

foreach ($document->pages as $page) {
    foreach ($page->items as $item) {
        echo "{$item->text} @ ({$item->x}, {$item->y})\n";
    }
}

Parsel requires PHP 8.4 or greater and the lit binary.

Installation

You may install Parsel via Composer:

composer require shipfastlabs/parsel

Install the required lit binary:

vendor/bin/parsel-install-lit

For Office documents, spreadsheets, presentations, and images, you may also install the system dependencies:

vendor/bin/parsel-install-lit --with-system-dependencies

You may choose the installer used for the lit binary:

vendor/bin/parsel-install-lit --manager=npm
vendor/bin/parsel-install-lit --manager=pnpm
vendor/bin/parsel-install-lit --manager=bun
vendor/bin/parsel-install-lit --manager=pip
vendor/bin/parsel-install-lit --manager=cargo

You may also install lit yourself using one of the following commands:

cargo install liteparse
pip install liteparse
npm i -g @llamaindex/liteparse
pnpm add -g @llamaindex/liteparse
bun add -g @llamaindex/liteparse

If you prefer to install the system dependencies yourself, use the matching commands for your platform:

# macOS
brew install --cask libreoffice # For Office Document
brew install imagemagick # For Images

# Ubuntu / Debian
apt-get install libreoffice # For Office Document
apt-get install imagemagick # For Images

# Windows
choco install libreoffice-fresh # For Office Document
choco install imagemagick.app # For Images

LibreOffice is used for Office document conversion, ImageMagick is used for image conversion, and OCR support uses Tesseract through liteparse.

Parsing Files

The file method creates a parser instance for a document that already exists on disk. Once a source has been selected, you may choose how the parsed output should be returned.

use Shipfastlabs\Parsel;

$text = Parsel::file('/path/to/report.pdf')->text();

You may also parse raw bytes. This is useful when working with uploaded files, database blobs, or documents that have not been persisted to disk. Because byte sources do not include a filename, you should provide the file extension.

$document = Parsel::bytes($uploadedBytes, 'pdf')->parse();

Parsel may be used with PDFs, Word documents, spreadsheets, presentations, and images. The same fluent API is used for each supported file type:

$text = Parsel::file('contract.docx')->text();

$rows = Parsel::file('report.xlsx')->text();

$slides = Parsel::file('deck.pptx')->text();

$scan = Parsel::file('receipt.png')
    ->withOcr()
    ->text();

$photo = Parsel::file('invoice.jpg')
    ->withOcr(language: 'eng')
    ->parse();

Plain Text

The text method returns the parsed document text as a string. Parsel removes page header markers from text output before returning it.

$text = Parsel::file('document.pdf')
    ->withoutOcr()
    ->text();

Structured Documents

The parse method returns a Document object containing the document text, metadata, pages, and positioned text items. This is useful when you need coordinates, font information, or OCR confidence values.

$document = Parsel::file('document.pdf')->parse();

echo $document->text;
echo $document->pageCount();

foreach ($document->pages as $page) {
    foreach ($page->items as $item) {
        echo $item->text;
    }
}

The document may also be returned as an array:

$array = Parsel::file('document.pdf')->toArray();

Page Selection

The page, pages, and pageRange methods may be used to limit parsing to specific pages. These methods are additive, so you may combine multiple calls before parsing the document.

Parsel::file('document.pdf')->page(7);

Parsel::file('document.pdf')->pages(1, 3, 5);

Parsel::file('document.pdf')->pages('1-5', 10);

Parsel::file('document.pdf')->pageRange(1, 5);

Parsel::file('document.pdf')->pageRange(1, 5)->page(10);

If you only need to limit how many pages are parsed, you may use the maxPages method:

Parsel::file('document.pdf')->maxPages(50)->text();

OCR

OCR is disabled by default so that parsing remains fast and predictable. You may enable OCR using the withOcr method:

$text = Parsel::file('scan.pdf')->withOcr()->text();

The withOcr method accepts named arguments for the most common OCR options:

$text = Parsel::file('scan.pdf')
    ->withOcr(
        language: 'fra',
        tessdataPath: '/usr/share/tessdata',
        serverUrl: 'http://localhost:8828/ocr',
        workers: 8,
    )
    ->text();

If you would like to be explicit that OCR should not be used, you may call withoutOcr:

$text = Parsel::file('document.pdf')->withoutOcr()->text();

Rendering Options

Parsel includes convenience methods for common parser options such as rendering DPI, small text preservation, encrypted documents, and per-call process settings.

Parsel::file('document.pdf')->withDpi(300);

Parsel::file('document.pdf')->preserveSmallText();

Parsel::file('secret.pdf')->withPassword('hunter2');

Parsel::file('document.pdf')->withBinary('/usr/local/bin/lit');

Parsel::file('document.pdf')->withTimeout(120);

Additional Options

If you need to pass a flag that Parsel does not yet provide as a dedicated method, you may pass the option directly using the option method. Boolean options may be passed without a value, while options that require a value may receive one as the second argument.

Parsel::file('document.pdf')->option('some-new-flag');

Parsel::file('document.pdf')->option('some-new-flag', 42);

Saving Output

The save method writes parsed output to disk and returns the path that was written. When the target path ends in .json, Parsel will write JSON output. For all other extensions, Parsel will write plain text.

Parsel::file('document.pdf')->save('document.txt');

Parsel::file('document.pdf')->save('document.json');

Screenshots

You may use the screenshots method to render page screenshots into a directory. The method returns the image files found in the output directory after parsing has finished.

$screenshots = Parsel::file('document.pdf')
    ->pageRange(1, 5)
    ->screenshots('/tmp/parsel-pages');

For predictable results, you should pass a dedicated output directory that does not contain unrelated files.

Streaming Large Documents

The parse and toArray methods load the parsed document into memory. When working with large documents, you may use lazyPages to process one page at a time.

foreach (Parsel::file('large-document.pdf')->lazyPages() as $page) {
    foreach ($page->items as $item) {
        // Process one page at a time...
    }
}

This allows Parsel to read pages incrementally instead of keeping the full document in memory.

Document Data

A parsed Document contains the full text, metadata, and a list of pages. Each page contains its dimensions, page text, and text items with position data.

$document->text;
$document->metadata;
$document->pages;
$document->pageCount();

$page->number;
$page->width;
$page->height;
$page->text;
$page->items;

$item->text;
$item->x;
$item->y;
$item->width;
$item->height;
$item->fontName;
$item->fontSize;
$item->confidence;

Binary Resolution

When Parsel needs to parse a document, it resolves the lit binary in the following order:

The per-call binary configured with withBinary.
The global binary configured with Parsel::usingBinary.
The PARSEL_LIT_BINARY environment variable.
The lit binary available on the system PATH.

If no binary can be resolved, Parsel will throw a BinaryNotFoundException.

Parsel::usingBinary('/usr/local/bin/lit');

Parsel::defaultTimeout(120);

Testing

Parsel includes a fake runner that allows your tests to exercise parsing code without spawning the real binary. Response keys are matched against the command line as substrings. When multiple responses match, the longest matching key is used.

use Shipfastlabs\Parsel;

$fake = Parsel::fake([
    '--format json' => file_get_contents(__DIR__.'/fixtures/lit-output.json'),
]);

$document = Parsel::file('invoice.pdf')->parse();

expect($fake->recordedCommands()[0])->toContain('--format', 'json');

String responses are returned as successful stdout. If you need control over the exit code or stderr, you may provide a ProcessResult instance instead.

Development

Parsel uses Pint, Rector, PHPStan, and Pest to keep the codebase formatted and well tested.

composer lint
composer test:types
composer test:unit
composer test

The integration tests run against a real parser installation and are skipped when the lit binary is not available:

./vendor/bin/pest --group=integration

Credits

Parsel is maintained by Shipfastlabs and released under the MIT license.

shipfastlabs / parsel

Maintainers

Package info

Fund package maintenance!

Statistics

Security