daniel-jorg-schuppelius / php-pdf-toolkit
PHP 8.2+ library for PDF text extraction with automatic reader selection. Supports embedded text and scanned documents via OCR.
Installs: 32
Dependents: 1
Suggesters: 1
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/daniel-jorg-schuppelius/php-pdf-toolkit
Requires
- php: ^8.2 || ^8.3 || ^8.4
- ext-gd: *
- dompdf/dompdf: ^3.1
- dschuppelius/php-common-toolkit: ^1.1
- setasign/fpdi: ^2.3
- tecnickcom/tcpdf: ^6.6
Requires (Dev)
- phpunit/phpunit: ^11.0
README
A PHP 8.2+ library for extracting text from PDF documents and creating PDFs with intelligent reader/writer selection.
Features
PDF Text Extraction (Readers)
-
Multiple PDF Readers with automatic fallback:
pdftotext(poppler-utils) - Fast extraction for text-based PDFsPDFBox(Apache, Java) - Better handling of complex layoutsTesseract- OCR for scanned documentsOCRmyPDF- High-quality OCR with preprocessing
-
Automatic Reader Selection - Tries text extraction first, falls back to OCR if needed
-
Caching - Extracted text is cached to avoid redundant processing
-
Language Support - Configurable OCR languages (German + English by default)
PDF Creation (Writers)
-
Multiple PDF Writers with automatic fallback:
Dompdf- HTML to PDF conversion (pure PHP, LGPL)TCPDF- Programmatic PDF creation (pure PHP, LGPL)wkhtmltopdf- High-quality HTML rendering via WebKit (external tool)
-
Automatic Writer Selection - Uses the first available writer by priority
-
Multiple Input Formats - HTML, plain text, or HTML files
-
Metadata Support - Title, author, subject for generated PDFs
Requirements
- PHP 8.2+
For Text Extraction (at least one)
pdftotext(apt install poppler-utils)tesseract-ocr(apt install tesseract-ocr tesseract-ocr-deu)ocrmypdf(apt install ocrmypdf)- Java + PDFBox JAR (optional)
For PDF Creation (at least one)
dompdf/dompdf(composer require dompdf/dompdf)tecnickcom/tcpdf(composer require tecnickcom/tcpdf)wkhtmltopdf(apt install wkhtmltopdf)
Installation
Via Composer
composer require daniel-jorg-schuppelius/php-pdf-toolkit
Clone with Submodules
git clone --recurse-submodules https://github.com/Daniel-Jorg-Schuppelius/php-pdf-toolkit.git
Or if already cloned:
git submodule update --init
Install System Dependencies
Use the included install script for system dependencies:
# Install PDF extraction tools (poppler-utils, tesseract, ocrmypdf)
sudo ./installscript/install-dependencies.sh
Install PHP Libraries for PDF Creation
# Dompdf (recommended, pure PHP) composer require dompdf/dompdf # Or TCPDF (alternative, pure PHP) composer require tecnickcom/tcpdf # Or wkhtmltopdf (external tool, best quality) sudo apt install wkhtmltopdf
Usage
Text Extraction
use PDFToolkit\Registries\PDFReaderRegistry; $registry = new PDFReaderRegistry(); $document = $registry->extractText('/path/to/file.pdf', [ 'language' => 'deu+eng' ]); if ($document->hasText()) { echo $document->text; echo "Reader: " . $document->reader; echo "Scanned: " . ($document->isScanned ? 'Yes' : 'No'); }
PDF Creation
use PDFToolkit\Registries\PDFWriterRegistry; use PDFToolkit\Entities\PDFContent; $registry = new PDFWriterRegistry(); // Simple: HTML to PDF $registry->htmlToPdf('<h1>Hello World</h1><p>Content</p>', '/path/to/output.pdf'); // Simple: Text to PDF $registry->textToPdf('Plain text content', '/path/to/output.pdf'); // Advanced: With metadata and options $content = PDFContent::fromHtml($html, [ 'title' => 'My Document', 'author' => 'John Doe', 'subject' => 'Example PDF' ]); $registry->createPdf($content, '/path/to/output.pdf', [ 'paper_size' => 'A4', 'orientation' => 'portrait', 'margins' => ['top' => 15, 'bottom' => 15, 'left' => 15, 'right' => 15] ]); // Use specific writer $registry->createPdf($content, '/path/to/output.pdf', [], 'dompdf'); // Get PDF as string (for download/streaming) $pdfString = $registry->createPdfString($content); header('Content-Type: application/pdf'); echo $pdfString;
Check Available Tools
// Readers $readerRegistry = new PDFReaderRegistry(); foreach ($readerRegistry->getReaderInfo() as $info) { echo "{$info['name']}: " . ($info['available'] ? '✓' : '✗') . "\n"; } // Writers $writerRegistry = new PDFWriterRegistry(); foreach ($writerRegistry->getWriterInfo() as $info) { echo "{$info['name']}: " . ($info['available'] ? '✓' : '✗') . "\n"; }
Configuration
Tool paths can be configured in config/executables.json:
{
"shellExecutables": {
"pdftotext": {
"path": "/usr/bin/pdftotext",
"required": true
},
"wkhtmltopdf": {
"path": "/usr/bin/wkhtmltopdf",
"required": false
}
}
}
Architecture
PDFReaderRegistry → [Readers by Priority] → PDFDocument
↓
PdftotextReader (10) # Fast, for text PDFs
PdfboxReader (30) # Complex layouts
TesseractReader (50) # OCR for scans
OcrmypdfReader (60) # Best OCR quality
PDFWriterRegistry → [Writers by Priority] → PDF File
↓
DompdfWriter (10) # HTML→PDF, pure PHP
TcpdfWriter (20) # Programmatic, pure PHP
WkhtmltopdfWriter (30) # Best HTML rendering
License
AGPL-3.0-or-later - see LICENSE file.