falkemedia / pdf-extractor
This package automates the generation of an SQLite database that you can use to do a full-text search on a PDF.
Requires
- php: ^7.1
- ext-imagick: *
- ext-sqlite3: *
- intervention/image: ^2.5
- spatie/pdf-to-text: ^1.3.0
Requires (Dev)
- phpunit/phpunit: ^7.0
This package is auto-updated.
Last update: 2025-02-13 00:13:48 UTC
README
This package automates the generation of an SQLite database that you can use to do a full-text search on a PDF. Meaning you take your PDF, use this tool to generate a database and then query the database and not the PDF for any text search.
This tool also generates thumbnails that you can use to display your search results however you like.
This is heavily inspired spatie/pdf-to-image
and has a dependency of spatie/pdf-to-text
Installation
You can install the package via composer:
composer require falkemedia/pdf-extractor
This package requires the installation of ImageMagic and the imagick php extension.
Instructions for macOS Catalina + PHP 7.3:
brew install imagemagick pecl install imagick
If there are any errors with imagemagic I suggest reading through this guide
Also, behind the scenes this package leverages pdftotext. On a mac you can install the binary using brew
brew install poppler
Usage
examples/extract_pdf_data.php
<?php namespace falkemedia\PdfExtractor\Examples; use falkemedia\PdfExtractor\Extractor; require 'vendor/autoload.php'; // Load PDF $extractor = new Extractor(); $extractor->load('/path/to/a/pdf/file.pdf'); // Generate thumbnails $extractor ->setMaxThumbnailHeight(600) ->setMaxThumbnailWidth(480) ->setQuality(75) ->generateThumbnails(); // Store Fulltext infos $extractor->generateTextDatabase();
If you have a saved sqlite database you can do full-text queries like for example:
SELECT*FROM pages WHERE body MATCH "*YOUR_SEARCH_QUERY*"
Testing
composer test
Changelog
Please see CHANGELOG for more information what has changed recently.
Contributing
Please see CONTRIBUTING for details.
Security
If you discover any security related issues, please email tg@falkemedia.de instead of using the issue tracker.
Credits
License
The MIT License (MIT). Please see License File for more information.
PHP Package Boilerplate
This package was generated using the PHP Package Boilerplate.