jbo/pdf-extractor

This Library helps extracting content from a pdf file

1.0.0 2025-05-09 09:00 UTC

This package is auto-updated.

Last update: 2025-05-09 09:10:33 UTC


README

A PHP library for extracting text content from PDF files with multiple extraction methods.

Overview

This library provides a flexible way to extract text from PDF files using different extraction methods. It currently supports:

  1. SmalotPdfParser - A PHP-based PDF parser
  2. Pdftotext - Command-line utility from Poppler tools

Requirements

  • PHP 8.1 or higher
  • Composer
  • For Pdftotext extractor: Poppler tools installed on your system

Installation

Install via Composer:

composer require jbo/pdf-extractor

Usage

Basic Usage

<?php
require 'vendor/autoload.php';

use Jbo\PdfExtractor\PdfTextExtractor;
use Jbo\PdfExtractor\Extractor\SmalotPdfParserExtractor;

// 1. Choose an extractor
$extractor = new SmalotPdfParserExtractor();

// 2. Initialize the service
$service = new PdfTextExtractor($extractor);

// 3. Extract text from a PDF file
try {
    $text = $service->extract('/path/to/document.pdf');
    echo $text;
} catch (Exception $e) {
    echo "Error: " . $e->getMessage() . PHP_EOL;
}

Using Pdftotext Extractor (Windows)

<?php
require 'vendor/autoload.php';

use Jbo\PdfExtractor\PdfTextExtractor;
use Jbo\PdfExtractor\Extractor\PdftotextExtractor;

// Specify the path to pdftotext.exe from Poppler for Windows
$extractor = new PdftotextExtractor('C:\\path\\to\\poppler\\bin\\pdftotext.exe');
$service = new PdfTextExtractor($extractor);

// Extract text
$text = $service->extract('/path/to/document.pdf');

Extractors

SmalotPdfParserExtractor

Uses the smalot/pdfparser library to extract text from PDF files. This is a pure PHP solution that doesn't require external dependencies.

PdftotextExtractor

Uses the pdftotext command-line utility from Poppler tools to extract text. This method may provide better results for certain PDF files but requires the Poppler tools to be installed on your system.

Error Handling

The library throws exceptions in the following cases:

  • InvalidArgumentException: When the PDF file doesn't exist or isn't readable
  • RuntimeException: When text extraction fails

License

This library is licensed under the MIT License. See the LICENSE file for details.

Author

Jens Bourry