1tomany / pdf-to-image
A simple PHP library that makes extracting data from PDFs for large language models easy
Installs: 126
Dependents: 0
Suggesters: 0
Security: 0
Stars: 4
Watchers: 2
Forks: 1
Open Issues: 0
pkg:composer/1tomany/pdf-to-image
Requires
- php: >=8.2
- psr/container: ^2.0
- symfony/process: ^7.2|^8.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.93
- phpstan/phpstan: ^2.1
- phpunit/phpunit: ^12.5
README
pdf-ai is a simple PHP library that makes extracting data from PDFs for large language models easy. It uses a single dependency, the Symfony Process Component, to interface with the Poppler command line tools from the xpdf library.
Installation
Install the library using Composer:
composer require 1tomany/pdf-ai
Installing Poppler
Before beginning, ensure the pdfinfo, pdftoppm, and pdftotext binaries are installed and located in the $PATH environment variables.
macOS
brew install poppler
Debian and Ubuntu
apt-get install poppler-utils
Usage
This library has three main features:
- Read PDF metadata such as the number of pages
- Rasterize one or more pages to JPEG or PNG images
- Extract text from one or more pages
Extracted data is stored in memory and can be written to the filesystem or converted to a data: URI. Because extracted data is stored in memory, this library returns a \Generator object for each page that is extracted or rasterized.
Using the library is easy, and you have two ways to interact with it:
- Direct Instantiate the
OneToMany\PDFAI\Client\Poppler\PopplerExtractorClientclass and call the methods directly. This method is easier to use, but comes with the cost that your application will be less flexible and testable. - Actions Create a container of
OneToMany\PDFAI\Contract\Client\ExtractorClientInterfaceobjects, and use theOneToMany\PDFAI\Factory\ExtractorClientFactoryclass to instantiate them.
Note: A Symfony bundle is available if you wish to integrate this library into your Symfony applications with autowiring and configuration support.
Direct usage
<?php require_once __DIR__ . '/vendor/autoload.php'; use OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient; use OneToMany\PDFAI\Contract\Enum\OutputType; use OneToMany\PDFAI\Request\ExtractDataRequest; use OneToMany\PDFAI\Request\ExtractTextRequest; use OneToMany\PDFAI\Request\ReadMetadataRequest; $filePath = '/path/to/file.pdf'; // Construct the Poppler wrapper $client = new PopplerExtractorClient(); // Construct and execute a request to read the PDF metadata $metadata = $client->readMetadata(new ReadMetadataRequest($filePath)); vprintf("The PDF '%s' has %d page(s).\n", [ $filePath, $metadata->getPages(), ]); // Construct a request to rasterize all pages as 150 DPI JPEGs $request = new ExtractDataRequest($filePath, 1, null, OutputType::Jpg, 150); foreach ($client->extractData($request) as $image) { // $image->getData() or $image->toDataUri() printf("MD5: %s\n", md5($image->getData())); } // Extract text from pages 3 and 4 $request = new ExtractTextRequest($filePath, 3, 4); foreach ($client->extractData($request) as $text) { // $text->getData() printf("Length: %d\n", strlen($text->getData())); }
Test suite
Run the test suite with PHPUnit:
./vendor/bin/phpunit
Static analysis
Run static analysis with PHPStan:
./vendor/bin/phpstan
Credits
License
The MIT License