content-extract/content-processor

Robust PHP library for batch document processing. Extracts content from PDFs/text and generates structured JSON according to user-defined schemas. Now with semantic structuring, OCR support for scanned PDFs, text normalization, and alias-driven field matching. Production-ready, secure, zero unnecess

Maintainers

Package info

github.com/saul9809/content_extract-library

pkg:composer/content-extract/content-processor

Statistics

Installs: 17

Dependents: 0

Suggesters: 0

Stars: 1

Open Issues: 0

1.5.0 2026-04-20 06:29 UTC

This package is auto-updated.

Last update: 2026-05-20 12:29:03 UTC


README

Production-ready PHP library for batch document processing with intelligent content extraction and structuring.

Framework-agnostic, scalable, and optimized for real-world document pipelines from day one.

๐ŸŽฏ Purpose

Process multiple documents (PDFs, text files, images, etc.), extract their content, and convert it into configurable JSON structures ready for bulk loading into databases or services.

Quick Example

$result = ContentProcessor::make()
    ->withSchema($schema)
    ->withExtractor(new PdfTextExtractor())
    ->withStructurer(new SchemaAwareStructurer())
    ->fromDirectory('/documents')
    ->processFinal();  // Returns FinalResult with clean API

๐Ÿ“ฆ Installation

composer require content-extract/content-processor:^1.4.0

Or add to your composer.json:

{
  "require": {
    "content-extract/content-processor": "^1.4.0"
  }
}

๐Ÿ—๏ธ Project Structure

src/
โ”œโ”€โ”€ Contracts/              # Interfaces defining the contract
โ”‚   โ”œโ”€โ”€ ExtractorInterface.php
โ”‚   โ”œโ”€โ”€ StructurerInterface.php
โ”‚   โ””โ”€โ”€ SchemaInterface.php
โ”œโ”€โ”€ Core/                   # Main classes
โ”‚   โ””โ”€โ”€ ContentProcessor.php
โ”œโ”€โ”€ Extractors/             # Extractor implementations
โ”‚   โ”œโ”€โ”€ PdfTextExtractor.php
โ”‚   โ”œโ”€โ”€ TextFileExtractor.php
โ”‚   โ””โ”€โ”€ PdfOcrExtractor.php (v1.5.0+)
โ”œโ”€โ”€ Schemas/                # Schema implementations
โ”‚   โ””โ”€โ”€ ArraySchema.php
โ”œโ”€โ”€ Structurers/            # Structurer implementations
โ”‚   โ”œโ”€โ”€ SimpleLineStructurer.php
โ”‚   โ”œโ”€โ”€ RuleBasedStructurer.php
โ”‚   โ”œโ”€โ”€ SchemaAwareStructurer.php
โ”‚   โ””โ”€โ”€ CompositePdfExtractor.php (v1.5.0+)
โ”œโ”€โ”€ Utils/                  # Utilities
โ”‚   โ”œโ”€โ”€ TextNormalizer.php
โ”‚   โ””โ”€โ”€ TextSegmenter.php
โ””โ”€โ”€ Models/                 # Domain models
    โ”œโ”€โ”€ Warning.php
    โ”œโ”€โ”€ Error.php
    โ””โ”€โ”€ FinalResult.php

examples/
โ”œโ”€โ”€ example_basic.php
โ”œโ”€โ”€ example_semantic_structuring.php
โ””โ”€โ”€ sample_cv_*.txt

โšก Quick Start

1. Define Your Schema

use ContentProcessor\Schemas\ArraySchema;

$schema = new ArraySchema([
    'name' => [
        'type' => 'string',
        'required' => true,
        'aliases' => ['name', 'full name', 'applicant name'],
    ],
    'email' => [
        'type' => 'string',
        'required' => true,
        'aliases' => ['email', 'email address'],
    ],
    'experience_years' => [
        'type' => 'integer',
        'required' => false,
        'aliases' => ['years of experience', 'experience'],
    ],
]);

2. Configure the Processor

use ContentProcessor\Core\ContentProcessor;
use ContentProcessor\Extractors\PdfTextExtractor;
use ContentProcessor\Structurers\SchemaAwareStructurer;

$result = ContentProcessor::make()
    ->withSchema($schema)
    ->withExtractor(new PdfTextExtractor())
    ->withStructurer(new SchemaAwareStructurer())
    ->fromDirectory('/path/to/documents', '*.pdf')
    ->processFinal();

3. Consume Results

// Check status
if (!$result->isSuccessful()) {
    echo "Some documents failed:\n";
    foreach ($result->errors() as $error) {
        echo "  - " . $error->getMessage() . "\n";
    }
}

// Process successful data
foreach ($result->data() as $item) {
    echo "Processed: " . $item['document'] . "\n";
    // $item['data'] contains the structured data
    var_dump($item['data']);
}

// Inspect quality warnings
if ($result->hasWarnings()) {
    foreach ($result->warnings() as $warning) {
        echo "โš ๏ธ Field '{$warning->getField()}': {$warning->getMessage()}\n";
    }
}

// Export to JSON
echo $result->toJSONPretty();

๐Ÿงช Testing

Run Examples

cd examples
php example_basic.php
php example_semantic_structuring.php

Full Test Suite

composer test

Code Quality

composer lint

๐Ÿ”Œ Available Interfaces

ExtractorInterface

interface ExtractorInterface {
    public function extract(string $source): array;
    public function canHandle(string $source): bool;
    public function getName(): string;
}

StructurerInterface

interface StructurerInterface {
    public function structure(array $content, SchemaInterface $schema): array;
    public function getName(): string;
}

SchemaInterface

interface SchemaInterface {
    public function getDefinition(): array;
    public function validate(array $data): array;
    public function getName(): string;
}

๐Ÿ“‹ Processor Options

$processor->withOptions([
    'skip_invalid' => true,    // Skip documents that fail validation
    'preserve_empty' => false, // Preserve empty fields in result
]);

โœ… Implemented Features (Blocks 1-5)

Block 1: Core โœ…

  • Framework-agnostic design with clean interfaces
  • Extractor/Structurer pattern
  • JSON schema validation
  • Batch processing

Block 2: PDF Support โœ…

  • PdfTextExtractor with smalot/pdfparser
  • Batch processing with multiple PDFs
  • Robust error handling

Block 3: Semantic Structuring โœ…

  • SchemaAwareStructurer for intelligent extraction
  • Field aliases for semantic guidance
  • Text normalization and segmentation
  • Advanced warning system
  • Type conversion and validation

Block 4: Final Result API โœ…

  • Unified FinalResult object
  • Error and warning normalization
  • Summary with statistics
  • JSON export and serialization

Block 5: Security & Hardening โœ…

  • File size limits (10 MB default)
  • Batch document limits (50 documents default)
  • Path traversal protection
  • Configurable security validation
  • Production-ready defaults

Block 6: OCR Support (v1.5.0+) ๐Ÿš€

  • PdfOcrExtractor for scanned PDFs using Tesseract
  • Automatic fallback when digital extraction fails
  • Transparent OCR processing without code changes
  • Preserves semantic structuring pipeline

๐Ÿ” OCR Support (Optional)

This library supports OCR for scanned PDFs using Tesseract OCR.

Requirements

  • Tesseract OCR installed on the system
  • Language data files (e.g., eng for English)
  • Installation is handled by the operating system, not Composer

Automatic Fallback

OCR is automatically used when:

  • Digital text extraction returns insufficient text
  • Extracted text is empty or below threshold (default: 50 characters)
  • Extracted text contains no alphabetic characters

Example with OCR

use ContentProcessor\Extractors\CompositePdfExtractor;

// Automatically tries digital extraction first, then OCR if needed
$result = ContentProcessor::make()
    ->withSchema($schema)
    ->withExtractor(new CompositePdfExtractor())  // Tries PDF text first, then OCR
    ->withStructurer(new SchemaAwareStructurer())
    ->fromDirectory('/documents')
    ->processFinal();

Important Notes

  • OCR is optional - the library works fine with digital PDFs
  • OCR is NOT installed by Composer
  • OCR support does not change schema behavior
  • Aliases are still defined by your application
  • If Tesseract is not available, clear error messages are provided

๐Ÿ“š Documentation

๐Ÿ”Œ API Reference

FinalResult

$result = ContentProcessor::make()->...->processFinal();

// Access data
$result->data();           // Array of successful documents
$result->errors();         // Array of normalized errors
$result->warnings();       // Array of semantic warnings
$result->summary();        // Summary with statistics

// Status checks
$result->isSuccessful();   // bool - At least 1 successful?
$result->isPerfect();      // bool - No errors or warnings?
$result->hasErrors();      // bool
$result->hasWarnings();    // bool

// Filtering
$result->errorsByType('validation');
$result->warningsByField('email');
$result->warningsByCategory('missing_value');

// Serialization
$result->toArray();        // array
$result->toJSON();         // string (compact)
$result->toJSONPretty();   // string (formatted)
$result->fullResults();    // array (complete audit trail)

๐Ÿš€ Production Ready

The library is tested and ready for production deployment. See SECURITY.md for deployment recommendations.

๐Ÿ“‹ Requirements

  • PHP >= 8.1
  • Composer
  • (Optional) Tesseract OCR for scanned PDF support

๐Ÿ“„ License

MIT