divido/pdf-to-img

There is no license information available for the latest version (0.0.4) of this package.

Library to convert PDF's into images.

This package's canonical repository appears to be gone and the package has been frozen as a result.

0.0.4 2019-06-26 14:32 UTC

README

This library helps to convert PDF documents to images.

Table of Contents

Dependencies

For ImageMagick support (the convert tool), you will need ImageMagick and GhostScript installed. On macOS, you can most easily install these with the following command:

$ brew install imagemagick gs

For PDFtoPPM (Poppler) support, you will need the Poppler suite installed. On macOS, this can most easily be installed with the following command:

$ brew install poppler

These instructions are valid as of writing (2019-08-05); if they do not work for you, please update this documentation and submit a PR.

You'll also need to install the PHP dependencies with Composer:

$ composer install

Testing

This library has a test suite. You will need the appropriate dependencies installed as discussed above, and you can then run the tests with:

$ vendor/bin/phpunit

If you haven't installed the dependencies, you can expect some weird test failures. Please check that you have installed them correctly before reporting any issues.

PDF Sources

A source object contains the PDF data in one form or another. All source constructors require the PDF source (in various forms, documented below) and a filename to use for the saved PDF/image prefixes.

The bundled source options for the converter are:

1. Buffer

This source is available for when the entire PDF has been read into a string variable. E.g.

$pdf_source = file_get_contents('example.pdf');
$source = new Buffer($pdf_source, 'example.pdf');

2. BufferBase64

This source is available for when the entire PDF has been read into a string variable in base 64 encoding. E.g.

$pdf_source = base64_encode(file_get_contents('example.pdf'));
$source = new BufferBase64($pdf_source, 'example.pdf');

3. Stream

This source is available for when the PDF is available from a PSR-7 stream, useful if the PDF has been downloaded using Guzzle. E.g

// Assume file has been downloaded from S3.
$response = $s3->getObject([
  "Bucket" => 'your-bucket',
  "Key" => '/folder/example.pdf'
]);

// $response->Body is now a Guzzle Stream which implements PSR-7 StreamInterface
$source = new Stream($response->Body, 'example.pdf');

4. FileResource

This source is available for when the PDF is available from a file pointer

$fp = fopen('example.pdf', 'r');
$source = new FileResource($fp, 'example.pdf');

Conversion Engines.

A conversion engine performs the conversion of a PDF to a sequence of images. All engines allow for option setting.

The bundled conversions are:

1. ConvertBinaryEngine

The ConvertBinaryEngine uses the convert binary on your system (provided by ImageMagick) to perform the conversion.

Note that some of the options may not be available on your system, depending on your version of ImageMagick/convert you have installed.

$engine = EngineFactory::GetEngine('convert-binary');

// Optionally set arguments. @see https://www.imagemagick.org/script/convert.php for CLI options.
$engine->withOptions([
    '-quality' => '100',
]);

2. PpmToPdfBinaryEngine

The PpmToPdfBinaryEngine uses the pdftoppm binary on your system (provided by Poppler) to perform the conversion. Depending on your version, some argumentss may not be available.

Note that some of the options may not be available on your system, depending on your version of poppler/pdftoppm you have installed. Also please note that you should not set the image type in the arguments as this is handled higher up in the wrapper.

$engine = EngineFactory::GetEngine('pdftoppm-binary');

// Optionally set arguments. @see http://manpages.ubuntu.com/manpages/yakkety/man1/pdftoppm.1.html for CLI options.
$engine->withOptions([
    'r' => '150', 
]);

// Do not set the image type...
$engine->withOptions([
  '-png': '', // Don't do this...!
]);

Output

The converter will return an Output object, which has the following methods:

getPath()
This method returns the path on the disk where the images and original PDF has been saved to. E.g.

echo $output->getPath();
// string(14) /tmp/j2io0caMA

getOriginalPdf($withPath = false)
This method returns the original PDF filename. Optionally it returns the full path to the file. E.g.

echo $output->getOriginalPdf();
// string(11) example.pdf

// Or with full path
echo $output->getOriginalPdf(true);
// string(26) /tmp/j2io0caMA/example.pdf

getSubsetPdf($withPath = false)
This method returns the subset PDF filename. This may be null if no subset PDF was created. A subset PDF is only created if the conversion was performed on a subset of pages. Optionally it returns the full path to the file. E.g.

echo $output->getSubsetPdf();
// string(11) example-subset.pdf

// Or with full path
echo $output->getSubsetPdf(true);
// string(26) /tmp/j2io0caMA/example-subset.pdf


// Or with full path
echo $output->getGeneratedImages(true);
// array(2) [
//     string(28) /tmp/j2io0caMA/example-1.jpg 
//     string(28) /tmp/j2io0caMA/example-2.jpg
// ]

getGeneratedImages($withPath = false)
This method returns all the generated image filenames. Optionally it returns the full path to the images. E.g.

echo $output->getGeneratedImages();
// array(2) [
//     string(13) example-1.jpg 
//     string(13) example-2.jpg
// ]


// Or with full path
echo $output->getGeneratedImages(true);
// array(2) [
//     string(28) /tmp/j2io0caMA/example-1.jpg 
//     string(28) /tmp/j2io0caMA/example-2.jpg
// ]

Converter

The Converter class is where everything is put together and starts the conversion process.

This class will perform the conversion of the PDF into images and, if the page specification is set, it will create another PDF with just the specified pages. The following methods are available:

process($imageType, $pages = [])
Process the conversion and save as passed file type. Optionally specify the pages you want saved. E.g.

$converter = new Converter($source, $engine); // $source and $engine are desribed above.

// This will convert the entire PDF into JPEG's
$output = $converter->process("jpg")

// This will convert only pages 2 & 4 of the PDF into PNG's
// Note that this will also create a subset PDF with just pages 2 & 4
$output = $converter->process("png", [2,4]);

Putting it all together.

An example here used the following:

PDF has been read into a variable as a raw string, from a locally saved PDF (example.pdf) We are using the pdftoppm binary to do the conversion We want to save the PDF as credit-agreement.pdf We want our images to be JPEGs We want our images to be saved as credit-agreement-<page_num>.jpg

use DividoFinancialServices\PdfToImg\EngineFactory;
use DividoFinancialServices\PdfToImg\Converter;
use DividoFinancialServices\PdfToImg\Sources\Buffer;

// Load a PDF into a string
$buffer = new Buffer(file_get_contents('example.pdf'), 'credit-agreement.pdf');

// Create the conversion engine type. In this example we are using the pdftoppm binary.
$engine = EngineFactory::GetEngine('pdftoppm-binary');

// Create a Converter with the source PDF and conversion engine. 
$converter = new Converter($buffer, $engine);

// Do the conversion (saving images to JPEG)
$output = $converter->process("jpg", [2,4,]);

// 2 images (pages 2 & 4) are now saved in a temp folder on the disk. 

// Get the list of image filenames on disk
$images = $output->getGeneratedImages();

// A subset PDF has been created because the pages were specified. THe 2 pages PDF:
$subsetPdf = $output->getSubsetPdf();

// Do something with your images (upload to S3, etc..)
// When finished, perform a clean up to free up the disk space

$converter->cleanUp();