falkemedia/pdf-extractor

This package automates the generation of an SQLite database that you can use to do a full-text search on a PDF.

0.0.3 2020-08-12 13:30 UTC

This package is auto-updated.

Last update: 2024-04-12 21:47:49 UTC


README

Latest Version on Packagist Total Downloads

This package automates the generation of an SQLite database that you can use to do a full-text search on a PDF. Meaning you take your PDF, use this tool to generate a database and then query the database and not the PDF for any text search.

This tool also generates thumbnails that you can use to display your search results however you like.

This is heavily inspired spatie/pdf-to-image
and has a dependency of spatie/pdf-to-text

Installation

You can install the package via composer:

composer require falkemedia/pdf-extractor

This package requires the installation of ImageMagic and the imagick php extension.
Instructions for macOS Catalina + PHP 7.3:

brew install imagemagick 
pecl install imagick

If there are any errors with imagemagic I suggest reading through this guide

Also, behind the scenes this package leverages pdftotext. On a mac you can install the binary using brew

brew install poppler

Usage

examples/extract_pdf_data.php

<?php

namespace falkemedia\PdfExtractor\Examples;

use falkemedia\PdfExtractor\Extractor;

require 'vendor/autoload.php';

// Load PDF
$extractor = new Extractor();
$extractor->load('/path/to/a/pdf/file.pdf');

// Generate thumbnails
$extractor
    ->setMaxThumbnailHeight(600)
    ->setMaxThumbnailWidth(480)
    ->setQuality(75)
    ->generateThumbnails();

// Store Fulltext infos
$extractor->generateTextDatabase();

If you have a saved sqlite database you can do full-text queries like for example:

SELECT*FROM pages WHERE body MATCH "*YOUR_SEARCH_QUERY*"

Testing

composer test

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email tg@falkemedia.de instead of using the issue tracker.

Credits

License

The MIT License (MIT). Please see License File for more information.

PHP Package Boilerplate

This package was generated using the PHP Package Boilerplate.