darlanschmeller/doc-ocr-php

Document OCR and ingestion pipeline for PHP applications, powered by Mistral AI.

Maintainers

Package info

github.com/DarlanSchmeller/doc-ocr-php

Homepage

Issues

pkg:composer/darlanschmeller/doc-ocr-php

Statistics

Installs: 0

Dependents: 0

Suggesters: 0

Stars: 3

v1.0.4 2026-01-23 20:19 UTC

README

PHP Packagist License PHP Composer

DocOcr is a lightweight, pipeline-based PHP library that turns documents like PDF, CSV, and XLSX into structured data using Mistral's OCR API.

Designed for:

  • Ingestion pipelines
  • AI preprocessing
  • Finance / accounting docs
  • Backend automation

Features

  • Supports PDF, CSV, XLSX
  • Normalization layer for OCR-friendly input
  • OCR powered by Mistral AI
  • Extracts structured content (pages, text, tables)
  • Fluent pipeline API (normalize → ocr → toArray)
  • PHPUnit automated testing
  • Custom OCR client injection

Why DocOcr?

Most OCR libraries return raw text blobs. DocOcr focuses on pipeline-friendly, structured extraction designed for backend systems, AI preprocessing, and financial workflows.

It handles:

  • File normalization (CSV/XLSX → OCR-friendly layout)
  • OCR execution
  • Predictable output for downstream processing

Installation

Via Composer (recommended)

composer require darlanschmeller/doc-ocr-php

Include in your project:

require __DIR__ . '/vendor/autoload.php';

use DocOcr\Document;

From source (for development only)

git clone https://github.com/DarlanSchmeller/doc-ocr-php.git

Include in your project:

require __DIR__ . '/src/Document.php';

use DocOcr\Document;

Configuration

Set your Mistral API key in your .env file:

MISTRAL_API_KEY=your_api_key_here
MISTRAL_OCR_ENDPOINT=mistral_ocr_endpoint_here # (OPTIONAL) default included

Usage

Basic Usage

$ocr = Document::from(__DIR__ . '<your_file_path>')
    ->normalize()
    ->ocr()
    ->toArray();

$ocrResult = $ocr->getResult();

Injecting your own client instance

If you wish to use a different api key or custom OCR client you may inject it this way:

 $client = new MistralOcrClient(new OcrClient('<your_mistral_api_key>'));
        return Document::fromWithClient(__DIR__ . $fixture, $client)
            ->normalize()
            ->ocr()
            ->toArray();

Pipeline Stages

  1. normalize()

    • Converts CSV and XLSX files into OCR-friendly layouts
    • Reads PDFs and images as-is
  2. ocr()

    • Sends the document to Mistral OCR
    • Stores the raw OCR response
  3. toArray()

    • Decodes the OCR JSON response into a PHP array

All pipeline stages are idempotent and safe to call multiple times.

Output Example

[
  'pages' => [
    [
      'index' => 0,
      'markdown' => '
        Invoice Number: #20130304
        ATTENTION TO: Denny Gunawan
        221 Queen St, Melbourne 3000
        Total: $39.60
      ',
      'images' => [],
      'tables' => [
        [
          'id' => 'tbl-0.html',
          'format' => 'html',
          'content' => '
            Organic Items | Price/kg | Quantity | Subtotal
            Apple         | $5.00    | 1        | $5.00
            Orange        | $1.99    | 2        | $3.98
          '
        ]
      ]
    ]
  ]
]

📂 Supported Formats

Format Normalized OCR
PDF
CSV
XLSX
Images (png, jpg, webp) ⏭ skipped

Run automated tests

./vendor/bin/phpunit tests

OCR tests are skipped automatically if MISTRAL_API_KEY is not set.