README

pdf-ai is a simple PHP library that makes extracting data from PDFs for large language models easy. It uses a single dependency, the Symfony Process Component, to interface with the Poppler command line tools from the xpdf library.

Installation

Install the library using Composer:

composer require 1tomany/pdf-ai

Installing Poppler

Before beginning, ensure the pdfinfo, pdftoppm, and pdftotext binaries are installed and located in the $PATH environment variables.

macOS

brew install poppler

Debian and Ubuntu

apt-get install poppler-utils

Usage

This library has three main features:

Read PDF metadata such as the number of pages
Rasterize one or more pages to JPEG or PNG images
Extract text from one or more pages

Extracted data is stored in memory and can be written to the filesystem or converted to a data: URI. Because extracted data is stored in memory, this library returns a \Generator object for each page that is extracted or rasterized.

Using the library is easy, and you have two ways to interact with it:

Direct Instantiate the OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient class and call the methods directly. This method is easier to use, but comes with the cost that your application will be less flexible and testable.
Actions Create a container of OneToMany\PDFAI\Contract\Client\ExtractorClientInterface objects, and use the OneToMany\PDFAI\Factory\ExtractorClientFactory class to instantiate them.

Note: A Symfony bundle is available if you wish to integrate this library into your Symfony applications with autowiring and configuration support.

Direct usage

<?php

require_once __DIR__ . '/vendor/autoload.php';

use OneToMany\PDFAI\Client\Poppler\PopplerExtractorClient;
use OneToMany\PDFAI\Contract\Enum\OutputType;
use OneToMany\PDFAI\Request\ExtractDataRequest;
use OneToMany\PDFAI\Request\ExtractTextRequest;
use OneToMany\PDFAI\Request\ReadMetadataRequest;

$filePath = '/path/to/file.pdf';

// Construct the Poppler wrapper
$client = new PopplerExtractorClient();

// Construct and execute a request to read the PDF metadata
$metadata = $client->readMetadata(new ReadMetadataRequest($filePath));

vprintf("The PDF '%s' has %d page(s).\n", [
    $filePath, $metadata->getPages(),
]);

// Construct a request to rasterize all pages as 150 DPI JPEGs
$request = new ExtractDataRequest($filePath, 1, null, OutputType::Jpg, 150);

foreach ($client->extractData($request) as $image) {
    // $image->getData() or $image->toDataUri()
    printf("MD5: %s\n", md5($image->getData()));
}

// Extract text from pages 3 and 4
$request = new ExtractTextRequest($filePath, 3, 4);

foreach ($client->extractData($request) as $text) {
    // $text->getData()
    printf("Length: %d\n", strlen($text->getData()));
}

Test suite

Run the test suite with PHPUnit:

./vendor/bin/phpunit

Static analysis

Run static analysis with PHPStan:

./vendor/bin/phpstan

Credits

Vic Cherubini, 1:N Labs, LLC

License

The MIT License

1tomany / pdf-to-image

Maintainers

Details