jbo / pdf-extractor
This Library helps extracting content from a pdf file
Requires
- php: >=8.1.0
- smalot/pdfparser: ^v2.12.0
This package is auto-updated.
Last update: 2025-05-09 09:10:33 UTC
README
A PHP library for extracting text content from PDF files with multiple extraction methods.
Overview
This library provides a flexible way to extract text from PDF files using different extraction methods. It currently supports:
- SmalotPdfParser - A PHP-based PDF parser
- Pdftotext - Command-line utility from Poppler tools
Requirements
- PHP 8.1 or higher
- Composer
- For Pdftotext extractor: Poppler tools installed on your system
Installation
Install via Composer:
composer require jbo/pdf-extractor
Usage
Basic Usage
<?php
require 'vendor/autoload.php';
use Jbo\PdfExtractor\PdfTextExtractor;
use Jbo\PdfExtractor\Extractor\SmalotPdfParserExtractor;
// 1. Choose an extractor
$extractor = new SmalotPdfParserExtractor();
// 2. Initialize the service
$service = new PdfTextExtractor($extractor);
// 3. Extract text from a PDF file
try {
$text = $service->extract('/path/to/document.pdf');
echo $text;
} catch (Exception $e) {
echo "Error: " . $e->getMessage() . PHP_EOL;
}
Using Pdftotext Extractor (Windows)
<?php
require 'vendor/autoload.php';
use Jbo\PdfExtractor\PdfTextExtractor;
use Jbo\PdfExtractor\Extractor\PdftotextExtractor;
// Specify the path to pdftotext.exe from Poppler for Windows
$extractor = new PdftotextExtractor('C:\\path\\to\\poppler\\bin\\pdftotext.exe');
$service = new PdfTextExtractor($extractor);
// Extract text
$text = $service->extract('/path/to/document.pdf');
Extractors
SmalotPdfParserExtractor
Uses the smalot/pdfparser library to extract text from PDF files. This is a pure PHP solution that doesn't require external dependencies.
PdftotextExtractor
Uses the pdftotext
command-line utility from Poppler tools to extract text. This method may provide better results for certain PDF files but requires the Poppler tools to be installed on your system.
Error Handling
The library throws exceptions in the following cases:
InvalidArgumentException
: When the PDF file doesn't exist or isn't readableRuntimeException
: When text extraction fails
License
This library is licensed under the MIT License. See the LICENSE file for details.
Author
Jens Bourry