darlanschmeller / doc-ocr-php
Document OCR and ingestion pipeline for PHP applications, powered by Mistral AI.
Requires
- php: ^8.1
- phpoffice/phpspreadsheet: ^5.4
Requires (Dev)
- phpunit/phpunit: ^11
README
DocOcr is a lightweight, pipeline-based PHP library that turns documents like PDF, CSV, and XLSX into structured data using Mistral's OCR API.
Designed for:
- Ingestion pipelines
- AI preprocessing
- Finance / accounting docs
- Backend automation
Features
- Supports PDF, CSV, XLSX
- Normalization layer for OCR-friendly input
- OCR powered by Mistral AI
- Extracts structured content (pages, text, tables)
- Fluent pipeline API (
normalize → ocr → toArray) - PHPUnit automated testing
- Custom OCR client injection
Why DocOcr?
Most OCR libraries return raw text blobs. DocOcr focuses on pipeline-friendly, structured extraction designed for backend systems, AI preprocessing, and financial workflows.
It handles:
- File normalization (CSV/XLSX → OCR-friendly layout)
- OCR execution
- Predictable output for downstream processing
Installation
Via Composer (recommended)
composer require darlanschmeller/doc-ocr-php
Include in your project:
require __DIR__ . '/vendor/autoload.php'; use DocOcr\Document;
From source (for development only)
git clone https://github.com/DarlanSchmeller/doc-ocr-php.git
Include in your project:
require __DIR__ . '/src/Document.php'; use DocOcr\Document;
Configuration
Set your Mistral API key in your .env file:
MISTRAL_API_KEY=your_api_key_here MISTRAL_OCR_ENDPOINT=mistral_ocr_endpoint_here # (OPTIONAL) default included
Usage
Basic Usage
$ocr = Document::from(__DIR__ . '<your_file_path>') ->normalize() ->ocr() ->toArray(); $ocrResult = $ocr->getResult();
Injecting your own client instance
If you wish to use a different api key or custom OCR client you may inject it this way:
$client = new MistralOcrClient(new OcrClient('<your_mistral_api_key>')); return Document::fromWithClient(__DIR__ . $fixture, $client) ->normalize() ->ocr() ->toArray();
Pipeline Stages
-
normalize()- Converts CSV and XLSX files into OCR-friendly layouts
- Reads PDFs and images as-is
-
ocr()- Sends the document to Mistral OCR
- Stores the raw OCR response
-
toArray()- Decodes the OCR JSON response into a PHP array
All pipeline stages are idempotent and safe to call multiple times.
Output Example
[ 'pages' => [ [ 'index' => 0, 'markdown' => ' Invoice Number: #20130304 ATTENTION TO: Denny Gunawan 221 Queen St, Melbourne 3000 Total: $39.60 ', 'images' => [], 'tables' => [ [ 'id' => 'tbl-0.html', 'format' => 'html', 'content' => ' Organic Items | Price/kg | Quantity | Subtotal Apple | $5.00 | 1 | $5.00 Orange | $1.99 | 2 | $3.98 ' ] ] ] ] ]
📂 Supported Formats
| Format | Normalized | OCR |
|---|---|---|
| ✅ | ✅ | |
| CSV | ✅ | ✅ |
| XLSX | ✅ | ✅ |
| Images (png, jpg, webp) | ⏭ skipped | ✅ |
Run automated tests
./vendor/bin/phpunit tests
OCR tests are skipped automatically if
MISTRAL_API_KEYis not set.