README

DocOcr is a lightweight, pipeline-based PHP library that turns documents like PDF, CSV, and XLSX into structured data using Mistral's OCR API.

Designed for:

Ingestion pipelines
AI preprocessing
Finance / accounting docs
Backend automation

Features

Supports PDF, CSV, XLSX
Normalization layer for OCR-friendly input
OCR powered by Mistral AI
Extracts structured content (pages, text, tables)
Fluent pipeline API (normalize → ocr → toArray)
PHPUnit automated testing
Custom OCR client injection

Why DocOcr?

Most OCR libraries return raw text blobs. DocOcr focuses on pipeline-friendly, structured extraction designed for backend systems, AI preprocessing, and financial workflows.

It handles:

File normalization (CSV/XLSX → OCR-friendly layout)
OCR execution
Predictable output for downstream processing

Installation

Via Composer (recommended)

composer require darlanschmeller/doc-ocr-php

Include in your project:

require __DIR__ . '/vendor/autoload.php';

use DocOcr\Document;

From source (for development only)

git clone https://github.com/DarlanSchmeller/doc-ocr-php.git

Include in your project:

require __DIR__ . '/src/Document.php';

use DocOcr\Document;

Configuration

Set your Mistral API key in your .env file:

MISTRAL_API_KEY=your_api_key_here
MISTRAL_OCR_ENDPOINT=mistral_ocr_endpoint_here # (OPTIONAL) default included

Usage

Basic Usage

$ocr = Document::from(__DIR__ . '<your_file_path>')
    ->normalize()
    ->ocr()
    ->toArray();

$ocrResult = $ocr->getResult();

Injecting your own client instance

If you wish to use a different api key or custom OCR client you may inject it this way:

 $client = new MistralOcrClient(new OcrClient('<your_mistral_api_key>'));
        return Document::fromWithClient(__DIR__ . $fixture, $client)
            ->normalize()
            ->ocr()
            ->toArray();

Pipeline Stages

normalize()
- Converts CSV and XLSX files into OCR-friendly layouts
- Reads PDFs and images as-is
ocr()
- Sends the document to Mistral OCR
- Stores the raw OCR response
toArray()
- Decodes the OCR JSON response into a PHP array

All pipeline stages are idempotent and safe to call multiple times.

Output Example

[
  'pages' => [
    [
      'index' => 0,
      'markdown' => '
        Invoice Number: #20130304
        ATTENTION TO: Denny Gunawan
        221 Queen St, Melbourne 3000
        Total: $39.60
      ',
      'images' => [],
      'tables' => [
        [
          'id' => 'tbl-0.html',
          'format' => 'html',
          'content' => '
            Organic Items | Price/kg | Quantity | Subtotal
            Apple         | $5.00    | 1        | $5.00
            Orange        | $1.99    | 2        | $3.98
          '
        ]
      ]
    ]
  ]
]

📂 Supported Formats

Format	Normalized	OCR
PDF	✅	✅
CSV	✅	✅
XLSX	✅	✅
Images (png, jpg, webp)	⏭ skipped	✅

Run automated tests

./vendor/bin/phpunit tests

OCR tests are skipped automatically if MISTRAL_API_KEY is not set.

darlanschmeller / doc-ocr-php

Maintainers

Package info

Statistics

Security