shibashish/pdf-reader

A comprehensive Laravel package for extracting text, HTML, images, and metadata from PDF files using Poppler utilities.

Installs: 0

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 0

Forks: 0

Open Issues: 0

pkg:composer/shibashish/pdf-reader

v1.0.2 2025-12-09 09:25 UTC

This package is auto-updated.

Last update: 2025-12-09 09:48:04 UTC


README

A comprehensive, production-ready Laravel package for extracting content from PDF files using Poppler utilities. This package provides a secure, type-safe interface for PDF manipulation with extensive error handling and validation.

📋 Table of Contents

Overview

The PDF Reader Package wraps the powerful Poppler command-line utilities in a clean, Laravel-friendly API. It handles PDF text extraction, HTML conversion, image extraction, and metadata retrieval with built-in validation, security, and error handling.

Why This Package?

  • Secure: Uses Laravel's Process facade instead of unsafe shell_exec
  • Validated: Checks file existence, readability, and PDF format before processing
  • Type-Safe: Full PHP 8.2+ type hints for better IDE support
  • Cross-Platform: Works on Windows, macOS, and Linux
  • Well-Tested: Comprehensive Pest test suite included
  • Production-Ready: Proper exception handling and logging support

Features

Core Functionality

  • 📄 Text Extraction - Extract plain text from PDFs with optional page ranges
  • 🌐 HTML Conversion - Convert PDFs to HTML while preserving layout
  • đŸ–ŧī¸ Image Extraction - Extract all embedded images from PDFs
  • â„šī¸ Metadata Retrieval - Get PDF properties (author, title, page count, etc.)

Advanced Features

  • 📑 Page Range Support - Extract specific pages (e.g., "1-5", "3-10")
  • ✅ Input Validation - Automatic file existence and PDF format validation
  • 🔒 Secure Execution - Uses Laravel Process facade for safe command execution
  • đŸŽ¯ Custom Exceptions - Specific exceptions for different error scenarios
  • 💾 File Management - Option to keep or auto-delete temporary files
  • 🌍 Cross-Platform - Proper path handling for all operating systems

System Requirements

Required Software

  • PHP: 8.2 or higher
  • Laravel: 10.0 or higher
  • Poppler Utilities: All binaries must be installed and accessible

Poppler Binaries

The package requires the following Poppler command-line tools:

  • pdftotext - Text extraction
  • pdftohtml - HTML conversion
  • pdfinfo - Metadata retrieval
  • pdfimages - Image extraction

Dependencies

Installing Poppler Utilities

Ubuntu/Debian

sudo apt-get update
sudo apt-get install poppler-utils

Verify installation:

pdftotext -v
pdftohtml -v
pdfinfo -v
pdfimages -v

macOS

Using Homebrew:

brew install poppler

Verify installation:

which pdftotext
which pdftohtml
which pdfinfo
which pdfimages

Windows

  1. Download Poppler for Windows from GitHub Releases
  2. Extract the archive to a permanent location (e.g., C:\Program Files\poppler)
  3. Add the bin directory to your system PATH:
    • Right-click "This PC" → Properties → Advanced system settings
    • Environment Variables → System variables → Path → Edit
    • Add: C:\Program Files\poppler\Library\bin
  4. Restart your terminal/IDE

Verify installation:

pdftotext -v
pdftohtml -v
pdfinfo -v
pdfimages -v

Laravel Dependencies

This package uses the following Laravel features:

  • Illuminate\Support\Facades\Process - For secure command execution
  • Illuminate\Support\ServiceProvider - For package registration
  • Illuminate\Support\Facades\Facade - For the PdfReader facade

All dependencies are included in Laravel 10+.

Installation

Step 1: Package Location

This package is located at:

packages/shibashish/pdf-reader

It's already configured in your main composer.json under autoload-dev.

Step 2: Publish Configuration

Publish the package configuration file to your Laravel application:

php artisan vendor:publish --tag=pdf-reader-config

This creates config/pdf-reader.php with default settings.

Step 3: Configure Binary Paths (Optional)

If Poppler binaries are not in your system PATH, specify full paths in .env:

PDFTOTEXT_BINARY=/usr/bin/pdftotext
PDFTOHTML_BINARY=/usr/bin/pdftohtml
PDFINFO_BINARY=/usr/bin/pdfinfo
PDFIMAGES_BINARY=/usr/bin/pdfimages

Windows Example:

PDFTOTEXT_BINARY="C:\Program Files\poppler\Library\bin\pdftotext.exe"
PDFTOHTML_BINARY="C:\Program Files\poppler\Library\bin\pdftohtml.exe"
PDFINFO_BINARY="C:\Program Files\poppler\Library\bin\pdfinfo.exe"
PDFIMAGES_BINARY="C:\Program Files\poppler\Library\bin\pdfimages.exe"

Step 4: Create Storage Directories

The package auto-creates these directories when needed, but you can create them manually:

mkdir -p storage/app/public/pdf-reader/{texts,htmls,images}

Configuration

Configuration File

The published config/pdf-reader.php file contains:

<?php

return [
    // Path to pdftotext binary
    'pdftotext_binary' => env('PDFTOTEXT_BINARY', 'pdftotext'),
    
    // Path to pdftohtml binary
    'pdftohtml_binary' => env('PDFTOHTML_BINARY', 'pdftohtml'),
    
    // Path to pdfinfo binary
    'pdfinfo_binary' => env('PDFINFO_BINARY', 'pdfinfo'),
    
    // Path to pdfimages binary
    'pdfimages_binary' => env('PDFIMAGES_BINARY', 'pdfimages'),
];

Configuration Options

Key Default Description
pdftotext_binary pdftotext Path to pdftotext executable
pdftohtml_binary pdftohtml Path to pdftohtml executable
pdfinfo_binary pdfinfo Path to pdfinfo executable
pdfimages_binary pdfimages Path to pdfimages executable

Note: If binaries are in your system PATH, you can use just the binary name. Otherwise, provide the full absolute path.

Usage Guide

Import the Facade

use Shibashish\PdfReader\Facades\PdfReader;

Text Extraction

Basic Text Extraction

Extract all text from a PDF:

$text = PdfReader::extractText('/path/to/document.pdf');
echo $text; // Plain text content

Extract Specific Pages

Extract text from pages 1 to 5:

$text = PdfReader::extractText('/path/to/document.pdf', pages: '1-5');

Extract text from a single page:

$text = PdfReader::extractText('/path/to/document.pdf', pages: '3');

Keep Output File

By default, temporary files are deleted. To keep them:

$text = PdfReader::extractText(
    '/path/to/document.pdf',
    keepFile: true
);
// File saved to: storage/app/public/pdf-reader/texts/pdf-text-{timestamp}.txt

Method Signature

public function extractText(
    string $pdfPath,      // Path to PDF file
    bool $keepFile = false, // Keep temporary file?
    ?string $pages = null   // Page range (e.g., "1-5")
): ?string

HTML Conversion

Basic HTML Conversion

Convert entire PDF to HTML:

$html = PdfReader::extractHtml('/path/to/document.pdf');

Convert Specific Pages

$html = PdfReader::extractHtml('/path/to/document.pdf', pages: '1-3');

Keep Output File

$html = PdfReader::extractHtml(
    '/path/to/document.pdf',
    keepFile: true
);
// File saved to: storage/app/public/pdf-reader/htmls/pdf-html-{timestamp}.html

Method Signature

public function extractHtml(
    string $pdfPath,
    bool $keepFile = false,
    ?string $pages = null
): ?string

Image Extraction

Extract All Images

$images = PdfReader::extractImages('/path/to/document.pdf');

// Returns array:
// [
//     [
//         'name' => 'pdf-img-123456789-000.jpg',
//         'path' => '/full/path/to/temp/file.jpg',
//         'data' => <binary image data>
//     ],
//     [
//         'name' => 'pdf-img-123456789-001.png',
//         'path' => '/full/path/to/temp/file.png',
//         'data' => <binary image data>
//     ]
// ]

Keep Image Files

$images = PdfReader::extractImages('/path/to/document.pdf', keepFiles: true);

// Returns array:
// [
//     [
//         'name' => 'pdf-img-123456789-000.jpg',
//         'path' => '/full/path/to/storage/app/public/pdf-reader/images/pdf-img-123456789-000.jpg'
//     ]
// ]

Extract from Specific Pages

$images = PdfReader::extractImages('/path/to/document.pdf', pages: '1-5');

Save Images to Custom Location

$images = PdfReader::extractImages('/path/to/document.pdf', keepFiles: true);

foreach ($images as $image) {
    // Copy to custom location
    copy($image['path'], public_path('images/' . $image['name']));
}

Method Signature

public function extractImages(
    string $pdfPath,
    bool $keepFiles = false,
    ?string $pages = null
): array

Metadata Retrieval

Get PDF Information

$info = PdfReader::getInfo('/path/to/document.pdf');

print_r($info);
// Array
// (
//     [Title] => Sample Document
//     [Author] => John Doe
//     [Creator] => Microsoft Word
//     [Producer] => Adobe PDF Library
//     [CreationDate] => Mon Dec  9 10:30:45 2024 IST
//     [ModDate] => Mon Dec  9 11:00:00 2024 IST
//     [Tagged] => no
//     [UserProperties] => no
//     [Suspects] => no
//     [Form] => none
//     [JavaScript] => no
//     [Pages] => 25
//     [Encrypted] => no
//     [Page size] => 612 x 792 pts (letter)
//     [Page rot] => 0
//     [File size] => 1234567 bytes
//     [Optimized] => no
//     [PDF version] => 1.7
// )

Access Specific Metadata

$info = PdfReader::getInfo('/path/to/document.pdf');

$pageCount = $info['Pages'] ?? 0;
$author = $info['Author'] ?? 'Unknown';
$title = $info['Title'] ?? 'Untitled';

Method Signature

public function getInfo(string $pdfPath): array

Exception Handling

The package throws specific exceptions for different error scenarios.

Exception Hierarchy

Exception
└── PdfReaderException (base)
    ├── InvalidPdfException
    └── BinaryNotFoundException

InvalidPdfException

Thrown when:

  • File doesn't exist
  • File is not readable
  • File is not a valid PDF
use Shibashish\PdfReader\Exceptions\InvalidPdfException;

try {
    $text = PdfReader::extractText('/path/to/file.pdf');
} catch (InvalidPdfException $e) {
    echo $e->getMessage();
    // "The file '/path/to/file.pdf' does not exist."
    // "The file '/path/to/file.pdf' is not readable."
    // "The file '/path/to/file.pdf' is not a valid PDF."
}

BinaryNotFoundException

Thrown when a required Poppler binary is not found:

use Shibashish\PdfReader\Exceptions\BinaryNotFoundException;

try {
    $text = PdfReader::extractText('/path/to/file.pdf');
} catch (BinaryNotFoundException $e) {
    echo $e->getMessage();
    // "The required binary 'pdftotext' was not found or is not executable."
}

PdfReaderException

Thrown for general extraction errors:

use Shibashish\PdfReader\Exceptions\PdfReaderException;

try {
    $text = PdfReader::extractText('/path/to/file.pdf');
} catch (PdfReaderException $e) {
    echo $e->getMessage();
    // "Failed to extract text: [error details]"
}

Complete Exception Handling

use Shibashish\PdfReader\Facades\PdfReader;
use Shibashish\PdfReader\Exceptions\{
    InvalidPdfException,
    BinaryNotFoundException,
    PdfReaderException
};

try {
    $text = PdfReader::extractText($pdfPath);
    
} catch (InvalidPdfException $e) {
    // Handle invalid file
    Log::error('Invalid PDF file', ['path' => $pdfPath, 'error' => $e->getMessage()]);
    return response()->json(['error' => 'Invalid PDF file'], 400);
    
} catch (BinaryNotFoundException $e) {
    // Handle missing binary
    Log::critical('PDF binary not found', ['error' => $e->getMessage()]);
    return response()->json(['error' => 'Server configuration error'], 500);
    
} catch (PdfReaderException $e) {
    // Handle extraction error
    Log::error('PDF extraction failed', ['path' => $pdfPath, 'error' => $e->getMessage()]);
    return response()->json(['error' => 'Failed to process PDF'], 500);
}

Testing

The package includes comprehensive Pest tests.

Run Package Tests

From your Laravel project root:

# Run only PDF Reader tests
php artisan test --filter=PdfReader

# Run all tests
php artisan test

Test Coverage

The test suite covers:

  • ✅ Text extraction with validation
  • ✅ HTML conversion with page ranges
  • ✅ Metadata retrieval and parsing
  • ✅ Image extraction
  • ✅ Exception handling (invalid files, missing binaries)
  • ✅ Directory creation
  • ✅ Cross-platform path handling

Example Test Output

PASS  Tests\Feature\PdfReaderTest
✓ extract text runs correct command
✓ extract text with page range
✓ get info returns parsed data
✓ throws exception if file not found
✓ throws exception if not a pdf
✓ creates output directory
✓ extract images returns array

Tests:  7 passed (13 assertions)
Duration: 1.14s

Architecture

Package Structure

packages/shibashish/pdf-reader/
├── config/
│   └── pdf-reader.php          # Configuration file
├── src/
│   ├── Exceptions/
│   │   ├── PdfReaderException.php
│   │   ├── InvalidPdfException.php
│   │   └── BinaryNotFoundException.php
│   ├── Facades/
│   │   └── PdfReader.php        # Laravel facade
│   ├── PdfReaderService.php     # Main service class
│   └── PdfReaderServiceProvider.php
├── tests/
│   └── PdfReaderTest.php        # Pest tests
├── composer.json
└── README.md

Service Provider

The PdfReaderServiceProvider registers the service as a singleton:

$this->app->singleton('pdf-reader', function () {
    return new PdfReaderService;
});

Facade

The PdfReader facade provides static access:

PdfReader::extractText($path);
// Resolves to: app('pdf-reader')->extractText($path);

Service Class

PdfReaderService handles all PDF operations:

  • Input validation
  • Command building
  • Process execution
  • Error handling
  • Output parsing

Troubleshooting

Binary Not Found

Error: BinaryNotFoundException: The required binary 'pdftotext' was not found

Solutions:

  1. Verify Poppler is installed: which pdftotext (Linux/Mac) or where pdftotext (Windows)
  2. Add binary paths to .env:
    PDFTOTEXT_BINARY=/usr/bin/pdftotext
  3. Ensure binaries are in system PATH

Permission Denied

Error: InvalidPdfException: The file is not readable

Solutions:

  1. Check file permissions: ls -la /path/to/file.pdf
  2. Ensure web server user has read access:
    chmod 644 /path/to/file.pdf

Invalid PDF

Error: InvalidPdfException: The file is not a valid PDF

Solutions:

  1. Verify file is actually a PDF: file /path/to/file.pdf
  2. Check file isn't corrupted
  3. Ensure file has proper PDF header (%PDF-)

Output Directory Not Created

Error: Permission issues with storage/app/public/pdf-reader

Solutions:

  1. Ensure storage directory is writable:
    chmod -R 775 storage
    chown -R www-data:www-data storage
  2. Create directories manually:
    mkdir -p storage/app/public/pdf-reader/{texts,htmls,images}

Windows Path Issues

Error: Mixed path separators causing issues

Solution: The package uses DIRECTORY_SEPARATOR for cross-platform compatibility. Ensure you're using the latest version.

Output Files

Storage Locations

When keepFile: true or keepFiles: true, extracted files are saved to:

Type Location
Text storage/app/public/pdf-reader/texts/
HTML storage/app/public/pdf-reader/htmls/
Images storage/app/public/pdf-reader/images/

File Naming Convention

  • Text: pdf-text-{timestamp}.txt
  • HTML: pdf-html-{timestamp}.html
  • Images: pdf-img-{timestamp}-{number}.{ext}

Accessing Saved Files

// Text file
$text = PdfReader::extractText($path, keepFile: true);
$filePath = storage_path('app/public/pdf-reader/texts/pdf-text-' . time() . '.txt');

// Make publicly accessible
$url = asset('storage/pdf-reader/texts/pdf-text-' . time() . '.txt');

Best Practices

1. Always Handle Exceptions

try {
    $result = PdfReader::extractText($path);
} catch (PdfReaderException $e) {
    // Log and handle appropriately
}

2. Validate Input Before Processing

if (!file_exists($path)) {
    throw new \InvalidArgumentException('File not found');
}

$text = PdfReader::extractText($path);

3. Clean Up Temporary Files

// Default behavior - auto-cleanup
$text = PdfReader::extractText($path); // Temp file deleted

// Or manually manage
$text = PdfReader::extractText($path, keepFile: true);
// Process the file...
// Then delete manually if needed

4. Use Page Ranges for Large PDFs

// Extract first 10 pages only
$text = PdfReader::extractText($largePdf, pages: '1-10');

5. Configure Binaries in Environment

# Development
PDFTOTEXT_BINARY=pdftotext

# Production (absolute paths)
PDFTOTEXT_BINARY=/usr/bin/pdftotext

License

MIT License

Copyright (c) 2024 Shibashish

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.