content-extract / content-processor
Robust PHP library for batch document processing. Extracts content from PDFs/text and generates structured JSON according to user-defined schemas. Now with semantic structuring, OCR support for scanned PDFs, text normalization, and alias-driven field matching. Production-ready, secure, zero unnecess
Package info
github.com/saul9809/content_extract-library
pkg:composer/content-extract/content-processor
Requires
- php: >=8.1
- smalot/pdfparser: ^2.0
Requires (Dev)
- phpunit/phpunit: ^11.0
- squizlabs/php_codesniffer: ^3.7
README
Production-ready PHP library for batch document processing with intelligent content extraction and structuring.
Framework-agnostic, scalable, and optimized for real-world document pipelines from day one.
๐ฏ Purpose
Process multiple documents (PDFs, text files, images, etc.), extract their content, and convert it into configurable JSON structures ready for bulk loading into databases or services.
Quick Example
$result = ContentProcessor::make() ->withSchema($schema) ->withExtractor(new PdfTextExtractor()) ->withStructurer(new SchemaAwareStructurer()) ->fromDirectory('/documents') ->processFinal(); // Returns FinalResult with clean API
๐ฆ Installation
composer require content-extract/content-processor:^1.4.0
Or add to your composer.json:
{
"require": {
"content-extract/content-processor": "^1.4.0"
}
}
๐๏ธ Project Structure
src/
โโโ Contracts/ # Interfaces defining the contract
โ โโโ ExtractorInterface.php
โ โโโ StructurerInterface.php
โ โโโ SchemaInterface.php
โโโ Core/ # Main classes
โ โโโ ContentProcessor.php
โโโ Extractors/ # Extractor implementations
โ โโโ PdfTextExtractor.php
โ โโโ TextFileExtractor.php
โ โโโ PdfOcrExtractor.php (v1.5.0+)
โโโ Schemas/ # Schema implementations
โ โโโ ArraySchema.php
โโโ Structurers/ # Structurer implementations
โ โโโ SimpleLineStructurer.php
โ โโโ RuleBasedStructurer.php
โ โโโ SchemaAwareStructurer.php
โ โโโ CompositePdfExtractor.php (v1.5.0+)
โโโ Utils/ # Utilities
โ โโโ TextNormalizer.php
โ โโโ TextSegmenter.php
โโโ Models/ # Domain models
โโโ Warning.php
โโโ Error.php
โโโ FinalResult.php
examples/
โโโ example_basic.php
โโโ example_semantic_structuring.php
โโโ sample_cv_*.txt
โก Quick Start
1. Define Your Schema
use ContentProcessor\Schemas\ArraySchema; $schema = new ArraySchema([ 'name' => [ 'type' => 'string', 'required' => true, 'aliases' => ['name', 'full name', 'applicant name'], ], 'email' => [ 'type' => 'string', 'required' => true, 'aliases' => ['email', 'email address'], ], 'experience_years' => [ 'type' => 'integer', 'required' => false, 'aliases' => ['years of experience', 'experience'], ], ]);
2. Configure the Processor
use ContentProcessor\Core\ContentProcessor; use ContentProcessor\Extractors\PdfTextExtractor; use ContentProcessor\Structurers\SchemaAwareStructurer; $result = ContentProcessor::make() ->withSchema($schema) ->withExtractor(new PdfTextExtractor()) ->withStructurer(new SchemaAwareStructurer()) ->fromDirectory('/path/to/documents', '*.pdf') ->processFinal();
3. Consume Results
// Check status if (!$result->isSuccessful()) { echo "Some documents failed:\n"; foreach ($result->errors() as $error) { echo " - " . $error->getMessage() . "\n"; } } // Process successful data foreach ($result->data() as $item) { echo "Processed: " . $item['document'] . "\n"; // $item['data'] contains the structured data var_dump($item['data']); } // Inspect quality warnings if ($result->hasWarnings()) { foreach ($result->warnings() as $warning) { echo "โ ๏ธ Field '{$warning->getField()}': {$warning->getMessage()}\n"; } } // Export to JSON echo $result->toJSONPretty();
๐งช Testing
Run Examples
cd examples
php example_basic.php
php example_semantic_structuring.php
Full Test Suite
composer test
Code Quality
composer lint
๐ Available Interfaces
ExtractorInterface
interface ExtractorInterface { public function extract(string $source): array; public function canHandle(string $source): bool; public function getName(): string; }
StructurerInterface
interface StructurerInterface { public function structure(array $content, SchemaInterface $schema): array; public function getName(): string; }
SchemaInterface
interface SchemaInterface { public function getDefinition(): array; public function validate(array $data): array; public function getName(): string; }
๐ Processor Options
$processor->withOptions([ 'skip_invalid' => true, // Skip documents that fail validation 'preserve_empty' => false, // Preserve empty fields in result ]);
โ Implemented Features (Blocks 1-5)
Block 1: Core โ
- Framework-agnostic design with clean interfaces
- Extractor/Structurer pattern
- JSON schema validation
- Batch processing
Block 2: PDF Support โ
- PdfTextExtractor with smalot/pdfparser
- Batch processing with multiple PDFs
- Robust error handling
Block 3: Semantic Structuring โ
- SchemaAwareStructurer for intelligent extraction
- Field aliases for semantic guidance
- Text normalization and segmentation
- Advanced warning system
- Type conversion and validation
Block 4: Final Result API โ
- Unified FinalResult object
- Error and warning normalization
- Summary with statistics
- JSON export and serialization
Block 5: Security & Hardening โ
- File size limits (10 MB default)
- Batch document limits (50 documents default)
- Path traversal protection
- Configurable security validation
- Production-ready defaults
Block 6: OCR Support (v1.5.0+) ๐
- PdfOcrExtractor for scanned PDFs using Tesseract
- Automatic fallback when digital extraction fails
- Transparent OCR processing without code changes
- Preserves semantic structuring pipeline
๐ OCR Support (Optional)
This library supports OCR for scanned PDFs using Tesseract OCR.
Requirements
- Tesseract OCR installed on the system
- Language data files (e.g.,
engfor English) - Installation is handled by the operating system, not Composer
Automatic Fallback
OCR is automatically used when:
- Digital text extraction returns insufficient text
- Extracted text is empty or below threshold (default: 50 characters)
- Extracted text contains no alphabetic characters
Example with OCR
use ContentProcessor\Extractors\CompositePdfExtractor; // Automatically tries digital extraction first, then OCR if needed $result = ContentProcessor::make() ->withSchema($schema) ->withExtractor(new CompositePdfExtractor()) // Tries PDF text first, then OCR ->withStructurer(new SchemaAwareStructurer()) ->fromDirectory('/documents') ->processFinal();
Important Notes
- OCR is optional - the library works fine with digital PDFs
- OCR is NOT installed by Composer
- OCR support does not change schema behavior
- Aliases are still defined by your application
- If Tesseract is not available, clear error messages are provided
๐ Documentation
- ARCHITECTURE.md - Complete architectural design
- SECURITY.md - Security policy and configurable limits
- SEMANTIC_STRUCTURING_GUIDE.md - Schema aliases and matching
- QUICK_START_V1.4.0.md - Quick reference for v1.4.0+
๐ API Reference
FinalResult
$result = ContentProcessor::make()->...->processFinal(); // Access data $result->data(); // Array of successful documents $result->errors(); // Array of normalized errors $result->warnings(); // Array of semantic warnings $result->summary(); // Summary with statistics // Status checks $result->isSuccessful(); // bool - At least 1 successful? $result->isPerfect(); // bool - No errors or warnings? $result->hasErrors(); // bool $result->hasWarnings(); // bool // Filtering $result->errorsByType('validation'); $result->warningsByField('email'); $result->warningsByCategory('missing_value'); // Serialization $result->toArray(); // array $result->toJSON(); // string (compact) $result->toJSONPretty(); // string (formatted) $result->fullResults(); // array (complete audit trail)
๐ Production Ready
The library is tested and ready for production deployment. See SECURITY.md for deployment recommendations.
๐ Requirements
- PHP >= 8.1
- Composer
- (Optional) Tesseract OCR for scanned PDF support
๐ License
MIT