shibashish / pdf-reader
A comprehensive Laravel package for extracting text, HTML, images, and metadata from PDF files using Poppler utilities.
Installs: 0
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/shibashish/pdf-reader
Requires
- php: ^8.2
- illuminate/process: ^10.0|^11.0|^12.0
- illuminate/support: ^10.0|^11.0|^12.0
Requires (Dev)
- orchestra/testbench: ^8.0|^9.0
- pestphp/pest: ^2.0|^3.0
- pestphp/pest-plugin-laravel: ^2.0|^3.0
README
A comprehensive, production-ready Laravel package for extracting content from PDF files using Poppler utilities. This package provides a secure, type-safe interface for PDF manipulation with extensive error handling and validation.
đ Table of Contents
- Overview
- Features
- System Requirements
- Dependencies
- Installation
- Configuration
- Usage Guide
- Exception Handling
- Testing
- Architecture
- Troubleshooting
- License
Overview
The PDF Reader Package wraps the powerful Poppler command-line utilities in a clean, Laravel-friendly API. It handles PDF text extraction, HTML conversion, image extraction, and metadata retrieval with built-in validation, security, and error handling.
Why This Package?
- Secure: Uses Laravel's
Processfacade instead of unsafeshell_exec - Validated: Checks file existence, readability, and PDF format before processing
- Type-Safe: Full PHP 8.2+ type hints for better IDE support
- Cross-Platform: Works on Windows, macOS, and Linux
- Well-Tested: Comprehensive Pest test suite included
- Production-Ready: Proper exception handling and logging support
Features
Core Functionality
- đ Text Extraction - Extract plain text from PDFs with optional page ranges
- đ HTML Conversion - Convert PDFs to HTML while preserving layout
- đŧī¸ Image Extraction - Extract all embedded images from PDFs
- âšī¸ Metadata Retrieval - Get PDF properties (author, title, page count, etc.)
Advanced Features
- đ Page Range Support - Extract specific pages (e.g., "1-5", "3-10")
- â Input Validation - Automatic file existence and PDF format validation
- đ Secure Execution - Uses Laravel Process facade for safe command execution
- đ¯ Custom Exceptions - Specific exceptions for different error scenarios
- đž File Management - Option to keep or auto-delete temporary files
- đ Cross-Platform - Proper path handling for all operating systems
System Requirements
Required Software
- PHP: 8.2 or higher
- Laravel: 10.0 or higher
- Poppler Utilities: All binaries must be installed and accessible
Poppler Binaries
The package requires the following Poppler command-line tools:
pdftotext- Text extractionpdftohtml- HTML conversionpdfinfo- Metadata retrievalpdfimages- Image extraction
Dependencies
Installing Poppler Utilities
Ubuntu/Debian
sudo apt-get update sudo apt-get install poppler-utils
Verify installation:
pdftotext -v pdftohtml -v pdfinfo -v pdfimages -v
macOS
Using Homebrew:
brew install poppler
Verify installation:
which pdftotext which pdftohtml which pdfinfo which pdfimages
Windows
- Download Poppler for Windows from GitHub Releases
- Extract the archive to a permanent location (e.g.,
C:\Program Files\poppler) - Add the
bindirectory to your system PATH:- Right-click "This PC" â Properties â Advanced system settings
- Environment Variables â System variables â Path â Edit
- Add:
C:\Program Files\poppler\Library\bin
- Restart your terminal/IDE
Verify installation:
pdftotext -v pdftohtml -v pdfinfo -v pdfimages -v
Laravel Dependencies
This package uses the following Laravel features:
Illuminate\Support\Facades\Process- For secure command executionIlluminate\Support\ServiceProvider- For package registrationIlluminate\Support\Facades\Facade- For the PdfReader facade
All dependencies are included in Laravel 10+.
Installation
Step 1: Package Location
This package is located at:
packages/shibashish/pdf-reader
It's already configured in your main composer.json under autoload-dev.
Step 2: Publish Configuration
Publish the package configuration file to your Laravel application:
php artisan vendor:publish --tag=pdf-reader-config
This creates config/pdf-reader.php with default settings.
Step 3: Configure Binary Paths (Optional)
If Poppler binaries are not in your system PATH, specify full paths in .env:
PDFTOTEXT_BINARY=/usr/bin/pdftotext PDFTOHTML_BINARY=/usr/bin/pdftohtml PDFINFO_BINARY=/usr/bin/pdfinfo PDFIMAGES_BINARY=/usr/bin/pdfimages
Windows Example:
PDFTOTEXT_BINARY="C:\Program Files\poppler\Library\bin\pdftotext.exe" PDFTOHTML_BINARY="C:\Program Files\poppler\Library\bin\pdftohtml.exe" PDFINFO_BINARY="C:\Program Files\poppler\Library\bin\pdfinfo.exe" PDFIMAGES_BINARY="C:\Program Files\poppler\Library\bin\pdfimages.exe"
Step 4: Create Storage Directories
The package auto-creates these directories when needed, but you can create them manually:
mkdir -p storage/app/public/pdf-reader/{texts,htmls,images}
Configuration
Configuration File
The published config/pdf-reader.php file contains:
<?php return [ // Path to pdftotext binary 'pdftotext_binary' => env('PDFTOTEXT_BINARY', 'pdftotext'), // Path to pdftohtml binary 'pdftohtml_binary' => env('PDFTOHTML_BINARY', 'pdftohtml'), // Path to pdfinfo binary 'pdfinfo_binary' => env('PDFINFO_BINARY', 'pdfinfo'), // Path to pdfimages binary 'pdfimages_binary' => env('PDFIMAGES_BINARY', 'pdfimages'), ];
Configuration Options
| Key | Default | Description |
|---|---|---|
pdftotext_binary |
pdftotext |
Path to pdftotext executable |
pdftohtml_binary |
pdftohtml |
Path to pdftohtml executable |
pdfinfo_binary |
pdfinfo |
Path to pdfinfo executable |
pdfimages_binary |
pdfimages |
Path to pdfimages executable |
Note: If binaries are in your system PATH, you can use just the binary name. Otherwise, provide the full absolute path.
Usage Guide
Import the Facade
use Shibashish\PdfReader\Facades\PdfReader;
Text Extraction
Basic Text Extraction
Extract all text from a PDF:
$text = PdfReader::extractText('/path/to/document.pdf'); echo $text; // Plain text content
Extract Specific Pages
Extract text from pages 1 to 5:
$text = PdfReader::extractText('/path/to/document.pdf', pages: '1-5');
Extract text from a single page:
$text = PdfReader::extractText('/path/to/document.pdf', pages: '3');
Keep Output File
By default, temporary files are deleted. To keep them:
$text = PdfReader::extractText( '/path/to/document.pdf', keepFile: true ); // File saved to: storage/app/public/pdf-reader/texts/pdf-text-{timestamp}.txt
Method Signature
public function extractText( string $pdfPath, // Path to PDF file bool $keepFile = false, // Keep temporary file? ?string $pages = null // Page range (e.g., "1-5") ): ?string
HTML Conversion
Basic HTML Conversion
Convert entire PDF to HTML:
$html = PdfReader::extractHtml('/path/to/document.pdf');
Convert Specific Pages
$html = PdfReader::extractHtml('/path/to/document.pdf', pages: '1-3');
Keep Output File
$html = PdfReader::extractHtml( '/path/to/document.pdf', keepFile: true ); // File saved to: storage/app/public/pdf-reader/htmls/pdf-html-{timestamp}.html
Method Signature
public function extractHtml( string $pdfPath, bool $keepFile = false, ?string $pages = null ): ?string
Image Extraction
Extract All Images
$images = PdfReader::extractImages('/path/to/document.pdf'); // Returns array: // [ // [ // 'name' => 'pdf-img-123456789-000.jpg', // 'path' => '/full/path/to/temp/file.jpg', // 'data' => <binary image data> // ], // [ // 'name' => 'pdf-img-123456789-001.png', // 'path' => '/full/path/to/temp/file.png', // 'data' => <binary image data> // ] // ]
Keep Image Files
$images = PdfReader::extractImages('/path/to/document.pdf', keepFiles: true); // Returns array: // [ // [ // 'name' => 'pdf-img-123456789-000.jpg', // 'path' => '/full/path/to/storage/app/public/pdf-reader/images/pdf-img-123456789-000.jpg' // ] // ]
Extract from Specific Pages
$images = PdfReader::extractImages('/path/to/document.pdf', pages: '1-5');
Save Images to Custom Location
$images = PdfReader::extractImages('/path/to/document.pdf', keepFiles: true); foreach ($images as $image) { // Copy to custom location copy($image['path'], public_path('images/' . $image['name'])); }
Method Signature
public function extractImages( string $pdfPath, bool $keepFiles = false, ?string $pages = null ): array
Metadata Retrieval
Get PDF Information
$info = PdfReader::getInfo('/path/to/document.pdf'); print_r($info); // Array // ( // [Title] => Sample Document // [Author] => John Doe // [Creator] => Microsoft Word // [Producer] => Adobe PDF Library // [CreationDate] => Mon Dec 9 10:30:45 2024 IST // [ModDate] => Mon Dec 9 11:00:00 2024 IST // [Tagged] => no // [UserProperties] => no // [Suspects] => no // [Form] => none // [JavaScript] => no // [Pages] => 25 // [Encrypted] => no // [Page size] => 612 x 792 pts (letter) // [Page rot] => 0 // [File size] => 1234567 bytes // [Optimized] => no // [PDF version] => 1.7 // )
Access Specific Metadata
$info = PdfReader::getInfo('/path/to/document.pdf'); $pageCount = $info['Pages'] ?? 0; $author = $info['Author'] ?? 'Unknown'; $title = $info['Title'] ?? 'Untitled';
Method Signature
public function getInfo(string $pdfPath): array
Exception Handling
The package throws specific exceptions for different error scenarios.
Exception Hierarchy
Exception
âââ PdfReaderException (base)
âââ InvalidPdfException
âââ BinaryNotFoundException
InvalidPdfException
Thrown when:
- File doesn't exist
- File is not readable
- File is not a valid PDF
use Shibashish\PdfReader\Exceptions\InvalidPdfException; try { $text = PdfReader::extractText('/path/to/file.pdf'); } catch (InvalidPdfException $e) { echo $e->getMessage(); // "The file '/path/to/file.pdf' does not exist." // "The file '/path/to/file.pdf' is not readable." // "The file '/path/to/file.pdf' is not a valid PDF." }
BinaryNotFoundException
Thrown when a required Poppler binary is not found:
use Shibashish\PdfReader\Exceptions\BinaryNotFoundException; try { $text = PdfReader::extractText('/path/to/file.pdf'); } catch (BinaryNotFoundException $e) { echo $e->getMessage(); // "The required binary 'pdftotext' was not found or is not executable." }
PdfReaderException
Thrown for general extraction errors:
use Shibashish\PdfReader\Exceptions\PdfReaderException; try { $text = PdfReader::extractText('/path/to/file.pdf'); } catch (PdfReaderException $e) { echo $e->getMessage(); // "Failed to extract text: [error details]" }
Complete Exception Handling
use Shibashish\PdfReader\Facades\PdfReader; use Shibashish\PdfReader\Exceptions\{ InvalidPdfException, BinaryNotFoundException, PdfReaderException }; try { $text = PdfReader::extractText($pdfPath); } catch (InvalidPdfException $e) { // Handle invalid file Log::error('Invalid PDF file', ['path' => $pdfPath, 'error' => $e->getMessage()]); return response()->json(['error' => 'Invalid PDF file'], 400); } catch (BinaryNotFoundException $e) { // Handle missing binary Log::critical('PDF binary not found', ['error' => $e->getMessage()]); return response()->json(['error' => 'Server configuration error'], 500); } catch (PdfReaderException $e) { // Handle extraction error Log::error('PDF extraction failed', ['path' => $pdfPath, 'error' => $e->getMessage()]); return response()->json(['error' => 'Failed to process PDF'], 500); }
Testing
The package includes comprehensive Pest tests.
Run Package Tests
From your Laravel project root:
# Run only PDF Reader tests php artisan test --filter=PdfReader # Run all tests php artisan test
Test Coverage
The test suite covers:
- â Text extraction with validation
- â HTML conversion with page ranges
- â Metadata retrieval and parsing
- â Image extraction
- â Exception handling (invalid files, missing binaries)
- â Directory creation
- â Cross-platform path handling
Example Test Output
PASS Tests\Feature\PdfReaderTest
â extract text runs correct command
â extract text with page range
â get info returns parsed data
â throws exception if file not found
â throws exception if not a pdf
â creates output directory
â extract images returns array
Tests: 7 passed (13 assertions)
Duration: 1.14s
Architecture
Package Structure
packages/shibashish/pdf-reader/
âââ config/
â âââ pdf-reader.php # Configuration file
âââ src/
â âââ Exceptions/
â â âââ PdfReaderException.php
â â âââ InvalidPdfException.php
â â âââ BinaryNotFoundException.php
â âââ Facades/
â â âââ PdfReader.php # Laravel facade
â âââ PdfReaderService.php # Main service class
â âââ PdfReaderServiceProvider.php
âââ tests/
â âââ PdfReaderTest.php # Pest tests
âââ composer.json
âââ README.md
Service Provider
The PdfReaderServiceProvider registers the service as a singleton:
$this->app->singleton('pdf-reader', function () { return new PdfReaderService; });
Facade
The PdfReader facade provides static access:
PdfReader::extractText($path); // Resolves to: app('pdf-reader')->extractText($path);
Service Class
PdfReaderService handles all PDF operations:
- Input validation
- Command building
- Process execution
- Error handling
- Output parsing
Troubleshooting
Binary Not Found
Error: BinaryNotFoundException: The required binary 'pdftotext' was not found
Solutions:
- Verify Poppler is installed:
which pdftotext(Linux/Mac) orwhere pdftotext(Windows) - Add binary paths to
.env:PDFTOTEXT_BINARY=/usr/bin/pdftotext
- Ensure binaries are in system PATH
Permission Denied
Error: InvalidPdfException: The file is not readable
Solutions:
- Check file permissions:
ls -la /path/to/file.pdf - Ensure web server user has read access:
chmod 644 /path/to/file.pdf
Invalid PDF
Error: InvalidPdfException: The file is not a valid PDF
Solutions:
- Verify file is actually a PDF:
file /path/to/file.pdf - Check file isn't corrupted
- Ensure file has proper PDF header (
%PDF-)
Output Directory Not Created
Error: Permission issues with storage/app/public/pdf-reader
Solutions:
- Ensure storage directory is writable:
chmod -R 775 storage chown -R www-data:www-data storage
- Create directories manually:
mkdir -p storage/app/public/pdf-reader/{texts,htmls,images}
Windows Path Issues
Error: Mixed path separators causing issues
Solution: The package uses DIRECTORY_SEPARATOR for cross-platform compatibility. Ensure you're using the latest version.
Output Files
Storage Locations
When keepFile: true or keepFiles: true, extracted files are saved to:
| Type | Location |
|---|---|
| Text | storage/app/public/pdf-reader/texts/ |
| HTML | storage/app/public/pdf-reader/htmls/ |
| Images | storage/app/public/pdf-reader/images/ |
File Naming Convention
- Text:
pdf-text-{timestamp}.txt - HTML:
pdf-html-{timestamp}.html - Images:
pdf-img-{timestamp}-{number}.{ext}
Accessing Saved Files
// Text file $text = PdfReader::extractText($path, keepFile: true); $filePath = storage_path('app/public/pdf-reader/texts/pdf-text-' . time() . '.txt'); // Make publicly accessible $url = asset('storage/pdf-reader/texts/pdf-text-' . time() . '.txt');
Best Practices
1. Always Handle Exceptions
try { $result = PdfReader::extractText($path); } catch (PdfReaderException $e) { // Log and handle appropriately }
2. Validate Input Before Processing
if (!file_exists($path)) { throw new \InvalidArgumentException('File not found'); } $text = PdfReader::extractText($path);
3. Clean Up Temporary Files
// Default behavior - auto-cleanup $text = PdfReader::extractText($path); // Temp file deleted // Or manually manage $text = PdfReader::extractText($path, keepFile: true); // Process the file... // Then delete manually if needed
4. Use Page Ranges for Large PDFs
// Extract first 10 pages only $text = PdfReader::extractText($largePdf, pages: '1-10');
5. Configure Binaries in Environment
# Development PDFTOTEXT_BINARY=pdftotext # Production (absolute paths) PDFTOTEXT_BINARY=/usr/bin/pdftotext
License
MIT License
Copyright (c) 2024 Shibashish
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.