cloudbluedigital / pdftohtml
PDF to HTML converter with PHP using Poppler-utils
Requires
- illuminate/config: ~5
- symfony/filesystem: ^4.2
- symfony/process: ^4.2
- thesoftwarefanatics/php-html-parser: ^1.8.0
Requires (Dev)
- phpunit/phpunit: ~4
- satooshi/php-coveralls: ^1.0
This package is not auto-updated.
Last update: 2025-03-03 20:06:26 UTC
README
PDF to HTML PHP Class
A simple class for converting PDF files into HTML documents. This package was forked from the original maintainer. As it has since been abandoned, I've decided to migrate the package and port it so that it can be used in php 7.1+ environments.
Installation
composer require garrensweet/pdftohtml-php
Or add this package to your composer.json
{
"garrensweet/pdftohtml-php": "^2.1.0"
}
Requirements
You must install the
poppler-utils
package on your system. You must also make sure that the user who ownspoppler-utils
aligns with the yourNginx
user, otherwise you will not be able to access this package.Before instantiating the
Pdf
class, you will need to tell the library about the location of your binaries. Without this, the default fallback will be used (which is likely incorrect for most people) and you will receive a generic error. You may do this by using theConfig::set
method of this class.
Note: The
Config
method is the same repository implementation that Laravel uses.
\Gswits\PdfToHtml\Config::set('pdftohtml.bin', '/usr/local/bin/pdftohtml');
\Gswits\PdfToHtml\Config::set('pdfinfo.bin', '/usr/local/bin/pdfinfo');
Usage
Having setup your poll-utils package and provided the location to the library, you can proceed with the following:
WARNING! If you're not working in an environment that automatically loads the vendor list from composer, you will need to manually do so yourself by adding
include /vendor/autoload.php
at the top of your file. If you're in Laravel, you do not need this.
An example use case follows:
<?php
// if you are using composer, just use this
include 'vendor/autoload.php';
// initiate
$pdf = new Gswits\PdfToHtml\Pdf('file.pdf');
// convert to html string
$html = $pdf->html();
// convert a specific page to html string
$page = $pdf->html(3);
// convert to html and return it as [Dom Object](https://github.com/thesoftwarefanatics/php-html-parser)
$dom = $pdf->getDom();
// check if your pdf has more than one pages
$total_pages = $pdf->getPages();
// Your pdf happen to have more than one pages and you want to go another page? Got it. use this command to change the current page to page 3
$dom->goToPage(3);
// and then you can do as you please with that dom, you can find any element you want
$paragraphs = $dom->find('body > p');
?>
Passing options to getDOM
By default getDom()
will extract all of the images contained in the pdf. If you do not wish to maintain the images, you can specify this property prior to calling `\$pdf->html() to generate your HTML document.
<?php
$pdfDom = $pdf->getDom(['ignoreImages' => true]);
Available Options
Additionally, you may pass several arguments to the Pdf
constructor. These arguments are passed as flags to the underlying pdftohtml
binary. You can view the man page for a full list of options
- singlePage, default: false
- imageJpeg, default: false
- ignoreImages, default: false
- zoom, default: 1.5
- noFrames, default: true
Usage note for Windows Users
For those who need this package in windows, there is a way. First download poppler-utils for windows here http://blog.alivate.com.au/poppler-windows/. And download the latest binary.
After download it, extract it. There will be a directory called bin
. We will need this one. Then change your code like this
<?php
// if you are using composer, just use this
include 'vendor/autoload.php';
use Gswits\PdfToHtml\Config;
// change pdftohtml bin location
Config::set('pdftohtml.bin', 'C:/poppler-0.37/bin/pdftohtml.exe');
// change pdfinfo bin location
Config::set('pdfinfo.bin', 'C:/poppler-0.37/bin/pdfinfo.exe');
// initiate
$pdf = new Gswits\PdfToHtml\Pdf('file.pdf');
// convert to html and return it as [Dom Object](hhttps://github.com/thesoftwarefanatics/php-html-parser)
$html = $pdf->html();
// check if your pdf has more than one pages
$total_pages = $pdf->getPages();
// Your pdf happen to have more than one pages and you want to go another page? Got it. use this command to change the current page to page 3
$html->goToPage(3);
// and then you can do as you please with that dom, you can find any element you want
$paragraphs = $html->find('body > p');
?>
Usage note for OS/X Users
Thanks to @kaleidoscopique for giving a try and make it run on OS/X for this package
1. Install brew
Brew is a famous package manager on OS/X : http://brew.sh/ (aptitude style).
2. Install poppler
brew install poppler
3. Verify the path of pdfinfo and pdftohtml
$ which pdfinfo
/usr/local/bin/pdfinfo
$ which pdftohtml
/usr/local/bin/pdfinfo
4. Whatever the paths are, use Gswits\PdfToHtml\Config::set
to set them in your php code. Obviously, use the same path as the one given by the which
command;
<?php
// if you are using composer, just use this
include 'vendor/autoload.php';
// change pdftohtml bin location
\Gswits\PdfToHtml\Config::set('pdftohtml.bin', '/usr/local/bin/pdftohtml');
// change pdfinfo bin location
\Gswits\PdfToHtml\Config::set('pdfinfo.bin', '/usr/local/bin/pdfinfo');
// initiate
$pdf = new Gswits\PdfToHtml\Pdf('file.pdf');
// convert to html and return it as [Dom Object](https://github.com/thesoftwarefanatics/php-html-parser)
$html = $pdf->html();
?>
Feedback & Contribute
Send me an issue for improvement or any buggy thing. I love to help and solve another people problems. Thanks :+1: