README

A comprehensive, production-ready Laravel package for extracting content from PDF files using Poppler utilities. This package provides a secure, type-safe interface for PDF manipulation with extensive error handling and validation.

📋 Table of Contents

Overview
Features
System Requirements
Dependencies
Installation
Configuration
Usage Guide
Exception Handling
Testing
Architecture
Troubleshooting
License

Overview

The PDF Reader Package wraps the powerful Poppler command-line utilities in a clean, Laravel-friendly API. It handles PDF text extraction, HTML conversion, image extraction, and metadata retrieval with built-in validation, security, and error handling.

Why This Package?

Secure: Uses Laravel's Process facade instead of unsafe shell_exec
Validated: Checks file existence, readability, and PDF format before processing
Type-Safe: Full PHP 8.2+ type hints for better IDE support
Cross-Platform: Works on Windows, macOS, and Linux
Well-Tested: Comprehensive Pest test suite included
Production-Ready: Proper exception handling and logging support

Features

Core Functionality

📄 Text Extraction - Extract plain text from PDFs with optional page ranges
🌐 HTML Conversion - Convert PDFs to HTML while preserving layout
🖼️ Image Extraction - Extract all embedded images from PDFs
ℹ️ Metadata Retrieval - Get PDF properties (author, title, page count, etc.)

Advanced Features

📑 Page Range Support - Extract specific pages (e.g., "1-5", "3-10")
✅ Input Validation - Automatic file existence and PDF format validation
🔒 Secure Execution - Uses Laravel Process facade for safe command execution
🎯 Custom Exceptions - Specific exceptions for different error scenarios
💾 File Management - Option to keep or auto-delete temporary files
🌍 Cross-Platform - Proper path handling for all operating systems

System Requirements

Required Software

PHP: 8.2 or higher
Laravel: 10.0 or higher
Poppler Utilities: All binaries must be installed and accessible

Poppler Binaries

The package requires the following Poppler command-line tools:

pdftotext - Text extraction
pdftohtml - HTML conversion
pdfinfo - Metadata retrieval
pdfimages - Image extraction

Dependencies

Installing Poppler Utilities

Ubuntu/Debian

sudo apt-get update
sudo apt-get install poppler-utils

Verify installation:

pdftotext -v
pdftohtml -v
pdfinfo -v
pdfimages -v

macOS

Using Homebrew:

brew install poppler

Verify installation:

which pdftotext
which pdftohtml
which pdfinfo
which pdfimages

Windows

Download Poppler for Windows from GitHub Releases
Extract the archive to a permanent location (e.g., C:\Program Files\poppler)
Add the bin directory to your system PATH:
- Right-click "This PC" → Properties → Advanced system settings
- Environment Variables → System variables → Path → Edit
- Add: C:\Program Files\poppler\Library\bin
Restart your terminal/IDE

Verify installation:

pdftotext -v
pdftohtml -v
pdfinfo -v
pdfimages -v

Laravel Dependencies

This package uses the following Laravel features:

Illuminate\Support\Facades\Process - For secure command execution
Illuminate\Support\ServiceProvider - For package registration
Illuminate\Support\Facades\Facade - For the PdfReader facade

All dependencies are included in Laravel 10+.

Installation

Step 1: Package Location

This package is located at:

packages/shibashish/pdf-reader

It's already configured in your main composer.json under autoload-dev.

Step 2: Publish Configuration

Publish the package configuration file to your Laravel application:

php artisan vendor:publish --tag=pdf-reader-config

This creates config/pdf-reader.php with default settings.

Step 3: Configure Binary Paths (Optional)

If Poppler binaries are not in your system PATH, specify full paths in .env:

PDFTOTEXT_BINARY=/usr/bin/pdftotext
PDFTOHTML_BINARY=/usr/bin/pdftohtml
PDFINFO_BINARY=/usr/bin/pdfinfo
PDFIMAGES_BINARY=/usr/bin/pdfimages

Windows Example:

PDFTOTEXT_BINARY="C:\Program Files\poppler\Library\bin\pdftotext.exe"
PDFTOHTML_BINARY="C:\Program Files\poppler\Library\bin\pdftohtml.exe"
PDFINFO_BINARY="C:\Program Files\poppler\Library\bin\pdfinfo.exe"
PDFIMAGES_BINARY="C:\Program Files\poppler\Library\bin\pdfimages.exe"

Step 4: Create Storage Directories

The package auto-creates these directories when needed, but you can create them manually:

mkdir -p storage/app/public/pdf-reader/{texts,htmls,images}

Configuration

Configuration File

The published config/pdf-reader.php file contains:

<?php

return [
    // Path to pdftotext binary
    'pdftotext_binary' => env('PDFTOTEXT_BINARY', 'pdftotext'),
    
    // Path to pdftohtml binary
    'pdftohtml_binary' => env('PDFTOHTML_BINARY', 'pdftohtml'),
    
    // Path to pdfinfo binary
    'pdfinfo_binary' => env('PDFINFO_BINARY', 'pdfinfo'),
    
    // Path to pdfimages binary
    'pdfimages_binary' => env('PDFIMAGES_BINARY', 'pdfimages'),
];

Configuration Options

Key	Default	Description
`pdftotext_binary`	`pdftotext`	Path to pdftotext executable
`pdftohtml_binary`	`pdftohtml`	Path to pdftohtml executable
`pdfinfo_binary`	`pdfinfo`	Path to pdfinfo executable
`pdfimages_binary`	`pdfimages`	Path to pdfimages executable

Note: If binaries are in your system PATH, you can use just the binary name. Otherwise, provide the full absolute path.

Usage Guide

Import the Facade

use Shibashish\PdfReader\Facades\PdfReader;

Text Extraction

Basic Text Extraction

Extract all text from a PDF:

$text = PdfReader::extractText('/path/to/document.pdf');
echo $text; // Plain text content

Extract Specific Pages

Extract text from pages 1 to 5:

$text = PdfReader::extractText('/path/to/document.pdf', pages: '1-5');

Extract text from a single page:

$text = PdfReader::extractText('/path/to/document.pdf', pages: '3');

Keep Output File

By default, temporary files are deleted. To keep them:

$text = PdfReader::extractText(
    '/path/to/document.pdf',
    keepFile: true
);
// File saved to: storage/app/public/pdf-reader/texts/pdf-text-{timestamp}.txt

Method Signature

public function extractText(
    string $pdfPath,      // Path to PDF file
    bool $keepFile = false, // Keep temporary file?
    ?string $pages = null   // Page range (e.g., "1-5")
): ?string

HTML Conversion

Basic HTML Conversion

Convert entire PDF to HTML:

$html = PdfReader::extractHtml('/path/to/document.pdf');

Convert Specific Pages

$html = PdfReader::extractHtml('/path/to/document.pdf', pages: '1-3');

Keep Output File

$html = PdfReader::extractHtml(
    '/path/to/document.pdf',
    keepFile: true
);
// File saved to: storage/app/public/pdf-reader/htmls/pdf-html-{timestamp}.html

Method Signature

public function extractHtml(
    string $pdfPath,
    bool $keepFile = false,
    ?string $pages = null
): ?string

Image Extraction

Extract All Images

$images = PdfReader::extractImages('/path/to/document.pdf');

// Returns array:
// [
//     [
//         'name' => 'pdf-img-123456789-000.jpg',
//         'path' => '/full/path/to/temp/file.jpg',
//         'data' => <binary image data>
//     ],
//     [
//         'name' => 'pdf-img-123456789-001.png',
//         'path' => '/full/path/to/temp/file.png',
//         'data' => <binary image data>
//     ]
// ]

Keep Image Files

$images = PdfReader::extractImages('/path/to/document.pdf', keepFiles: true);

// Returns array:
// [
//     [
//         'name' => 'pdf-img-123456789-000.jpg',
//         'path' => '/full/path/to/storage/app/public/pdf-reader/images/pdf-img-123456789-000.jpg'
//     ]
// ]

Extract from Specific Pages

$images = PdfReader::extractImages('/path/to/document.pdf', pages: '1-5');

Save Images to Custom Location

$images = PdfReader::extractImages('/path/to/document.pdf', keepFiles: true);

foreach ($images as $image) {
    // Copy to custom location
    copy($image['path'], public_path('images/' . $image['name']));
}

Method Signature

public function extractImages(
    string $pdfPath,
    bool $keepFiles = false,
    ?string $pages = null
): array

Metadata Retrieval

Get PDF Information

$info = PdfReader::getInfo('/path/to/document.pdf');

print_r($info);
// Array
// (
//     [Title] => Sample Document
//     [Author] => John Doe
//     [Creator] => Microsoft Word
//     [Producer] => Adobe PDF Library
//     [CreationDate] => Mon Dec  9 10:30:45 2024 IST
//     [ModDate] => Mon Dec  9 11:00:00 2024 IST
//     [Tagged] => no
//     [UserProperties] => no
//     [Suspects] => no
//     [Form] => none
//     [JavaScript] => no
//     [Pages] => 25
//     [Encrypted] => no
//     [Page size] => 612 x 792 pts (letter)
//     [Page rot] => 0
//     [File size] => 1234567 bytes
//     [Optimized] => no
//     [PDF version] => 1.7
// )

Access Specific Metadata

$info = PdfReader::getInfo('/path/to/document.pdf');

$pageCount = $info['Pages'] ?? 0;
$author = $info['Author'] ?? 'Unknown';
$title = $info['Title'] ?? 'Untitled';

Method Signature

public function getInfo(string $pdfPath): array

Exception Handling

The package throws specific exceptions for different error scenarios.

Exception Hierarchy

Exception
└── PdfReaderException (base)
    ├── InvalidPdfException
    └── BinaryNotFoundException

InvalidPdfException

Thrown when:

File doesn't exist
File is not readable
File is not a valid PDF

use Shibashish\PdfReader\Exceptions\InvalidPdfException;

try {
    $text = PdfReader::extractText('/path/to/file.pdf');
} catch (InvalidPdfException $e) {
    echo $e->getMessage();
    // "The file '/path/to/file.pdf' does not exist."
    // "The file '/path/to/file.pdf' is not readable."
    // "The file '/path/to/file.pdf' is not a valid PDF."
}

BinaryNotFoundException

Thrown when a required Poppler binary is not found:

use Shibashish\PdfReader\Exceptions\BinaryNotFoundException;

try {
    $text = PdfReader::extractText('/path/to/file.pdf');
} catch (BinaryNotFoundException $e) {
    echo $e->getMessage();
    // "The required binary 'pdftotext' was not found or is not executable."
}

PdfReaderException

Thrown for general extraction errors:

use Shibashish\PdfReader\Exceptions\PdfReaderException;

try {
    $text = PdfReader::extractText('/path/to/file.pdf');
} catch (PdfReaderException $e) {
    echo $e->getMessage();
    // "Failed to extract text: [error details]"
}

Complete Exception Handling

use Shibashish\PdfReader\Facades\PdfReader;
use Shibashish\PdfReader\Exceptions\{
    InvalidPdfException,
    BinaryNotFoundException,
    PdfReaderException
};

try {
    $text = PdfReader::extractText($pdfPath);
    
} catch (InvalidPdfException $e) {
    // Handle invalid file
    Log::error('Invalid PDF file', ['path' => $pdfPath, 'error' => $e->getMessage()]);
    return response()->json(['error' => 'Invalid PDF file'], 400);
    
} catch (BinaryNotFoundException $e) {
    // Handle missing binary
    Log::critical('PDF binary not found', ['error' => $e->getMessage()]);
    return response()->json(['error' => 'Server configuration error'], 500);
    
} catch (PdfReaderException $e) {
    // Handle extraction error
    Log::error('PDF extraction failed', ['path' => $pdfPath, 'error' => $e->getMessage()]);
    return response()->json(['error' => 'Failed to process PDF'], 500);
}

Testing

The package includes comprehensive Pest tests.

Run Package Tests

From your Laravel project root:

# Run only PDF Reader tests
php artisan test --filter=PdfReader

# Run all tests
php artisan test

Test Coverage

The test suite covers:

✅ Text extraction with validation
✅ HTML conversion with page ranges
✅ Metadata retrieval and parsing
✅ Image extraction
✅ Exception handling (invalid files, missing binaries)
✅ Directory creation
✅ Cross-platform path handling

Example Test Output

PASS  Tests\Feature\PdfReaderTest
✓ extract text runs correct command
✓ extract text with page range
✓ get info returns parsed data
✓ throws exception if file not found
✓ throws exception if not a pdf
✓ creates output directory
✓ extract images returns array

Tests:  7 passed (13 assertions)
Duration: 1.14s

Architecture

Package Structure

packages/shibashish/pdf-reader/
├── config/
│   └── pdf-reader.php          # Configuration file
├── src/
│   ├── Exceptions/
│   │   ├── PdfReaderException.php
│   │   ├── InvalidPdfException.php
│   │   └── BinaryNotFoundException.php
│   ├── Facades/
│   │   └── PdfReader.php        # Laravel facade
│   ├── PdfReaderService.php     # Main service class
│   └── PdfReaderServiceProvider.php
├── tests/
│   └── PdfReaderTest.php        # Pest tests
├── composer.json
└── README.md

Service Provider

The PdfReaderServiceProvider registers the service as a singleton:

$this->app->singleton('pdf-reader', function () {
    return new PdfReaderService;
});

Facade

The PdfReader facade provides static access:

PdfReader::extractText($path);
// Resolves to: app('pdf-reader')->extractText($path);

Service Class

PdfReaderService handles all PDF operations:

Input validation
Command building
Process execution
Error handling
Output parsing

Troubleshooting

Binary Not Found

Error: BinaryNotFoundException: The required binary 'pdftotext' was not found

Solutions:

Verify Poppler is installed: which pdftotext (Linux/Mac) or where pdftotext (Windows)
Add binary paths to .env:
```
PDFTOTEXT_BINARY=/usr/bin/pdftotext
```
Ensure binaries are in system PATH

Permission Denied

Error: InvalidPdfException: The file is not readable

Solutions:

Check file permissions: ls -la /path/to/file.pdf
Ensure web server user has read access:
```
chmod 644 /path/to/file.pdf
```

Invalid PDF

Error: InvalidPdfException: The file is not a valid PDF

Solutions:

Verify file is actually a PDF: file /path/to/file.pdf
Check file isn't corrupted
Ensure file has proper PDF header (%PDF-)

Output Directory Not Created

Error: Permission issues with storage/app/public/pdf-reader

Solutions:

Ensure storage directory is writable:

chmod -R 775 storage
chown -R www-data:www-data storage

Create directories manually:

mkdir -p storage/app/public/pdf-reader/{texts,htmls,images}

Windows Path Issues

Error: Mixed path separators causing issues

Solution: The package uses DIRECTORY_SEPARATOR for cross-platform compatibility. Ensure you're using the latest version.

Output Files

Storage Locations

When keepFile: true or keepFiles: true, extracted files are saved to:

Type	Location
Text	`storage/app/public/pdf-reader/texts/`
HTML	`storage/app/public/pdf-reader/htmls/`
Images	`storage/app/public/pdf-reader/images/`

File Naming Convention

Text: pdf-text-{timestamp}.txt
HTML: pdf-html-{timestamp}.html
Images: pdf-img-{timestamp}-{number}.{ext}

Accessing Saved Files

// Text file
$text = PdfReader::extractText($path, keepFile: true);
$filePath = storage_path('app/public/pdf-reader/texts/pdf-text-' . time() . '.txt');

// Make publicly accessible
$url = asset('storage/pdf-reader/texts/pdf-text-' . time() . '.txt');

Best Practices

1. Always Handle Exceptions

try {
    $result = PdfReader::extractText($path);
} catch (PdfReaderException $e) {
    // Log and handle appropriately
}

2. Validate Input Before Processing

if (!file_exists($path)) {
    throw new \InvalidArgumentException('File not found');
}

$text = PdfReader::extractText($path);

3. Clean Up Temporary Files

// Default behavior - auto-cleanup
$text = PdfReader::extractText($path); // Temp file deleted

// Or manually manage
$text = PdfReader::extractText($path, keepFile: true);
// Process the file...
// Then delete manually if needed

4. Use Page Ranges for Large PDFs

// Extract first 10 pages only
$text = PdfReader::extractText($largePdf, pages: '1-10');

5. Configure Binaries in Environment

# Development
PDFTOTEXT_BINARY=pdftotext

# Production (absolute paths)
PDFTOTEXT_BINARY=/usr/bin/pdftotext

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

shibashish / pdf-reader

Maintainers

Details

README

📋 Table of Contents

Overview

Why This Package?

Features

Core Functionality

Advanced Features

System Requirements

Required Software

Poppler Binaries

Dependencies

Installing Poppler Utilities

Ubuntu/Debian

macOS

Windows

Laravel Dependencies

Installation

Step 1: Package Location

Step 2: Publish Configuration

Step 3: Configure Binary Paths (Optional)

Step 4: Create Storage Directories

Configuration

Configuration File

Configuration Options

Usage Guide

Import the Facade

Text Extraction

Basic Text Extraction

Extract Specific Pages

Keep Output File

Method Signature

HTML Conversion

Basic HTML Conversion

Convert Specific Pages

Keep Output File

Method Signature

Image Extraction

Extract All Images

Keep Image Files

Extract from Specific Pages

Save Images to Custom Location

Method Signature

Metadata Retrieval

Get PDF Information

Access Specific Metadata

Method Signature

Exception Handling

Exception Hierarchy

InvalidPdfException

BinaryNotFoundException

PdfReaderException

Complete Exception Handling

Testing

Run Package Tests

Test Coverage

Example Test Output

Architecture

Package Structure

Service Provider

Facade

Service Class

Troubleshooting

Binary Not Found

Permission Denied

Invalid PDF

Output Directory Not Created

Windows Path Issues

Output Files

Storage Locations

File Naming Convention

Accessing Saved Files

Best Practices

1. Always Handle Exceptions

2. Validate Input Before Processing

3. Clean Up Temporary Files

4. Use Page Ranges for Large PDFs

5. Configure Binaries in Environment