bryank-ac / pdftohtml-php
PDF to HTML converter with PHP using Poppler-utils
Requires
- php: ^7.1.3
- illuminate/config: 5.6.*|5.7.*|5.8.*
- illuminate/filesystem: 5.6.*|5.7.*|5.8.*
- symfony/process: ^4.2
- thesoftwarefanatics/php-html-parser: ^1.8.1
Requires (Dev)
- php-coveralls/php-coveralls: ^2.1.0
- phpunit/phpunit: ^7.5|^8.0
README
A simple class for converting PDF files into HTML documents. This package was forked from the original maintainer. As it has since been abandoned, I've decided to migrate the package and port it so that it can be used in php 7.2+ environments.
Inspiration from garrensweet
PDF to HTML PHP Class
This class brought to you so you can use php and poppler-utils convert your pdf files to html file
Important Notes
Please see how to use below, since it's really upgraded and things in this package has already changed.
Installation
When you are in your active directory apps, you can just run this command to add this package on your app
composer require bryank-ac/pdftohtml-php
Or add this package to your composer.json
{ "bryank-ac/pdftohtml-php":"~2" }
Requirements
- Poppler-Utils
- Ubuntu Distro, just install it from apt
sudo apt-get install poppler-utils
- MacOS, use brew, see OSX notes section below
brew install poppler
- Ubuntu Distro, just install it from apt
- PHP Configuration with shell access enabled
Usage
Here is an example.
<?php // if you are using composer, just use this // not needed if your framework is already autoloading include 'vendor/autoload.php'; // initiate $pdf = new AccuCloud\PdfToHtml\Pdf('file.pdf'); // convert to html string $html = $pdf->html(); // convert a specific page to html string $page = $pdf->html(3); // convert to html and return it as [Dom Object](https://github.com/paquettg/php-html-parser) $dom = $pdf->getDom(); // check if your pdf has more than one pages $total_pages = $pdf->getPages(); // Your pdf happen to have more than one pages and you want to go another page? Got it. use this command to change the current page to page 3 $dom->goToPage(3); // and then you can do as you please with that dom, you can find any element you want $paragraphs = $dom->find('body > p'); // change pdftohtml bin location \AccuCloud\PdfToHtml\Config::set('pdftohtml.bin', '/usr/local/bin/pdftohtml'); // change pdfinfo bin location \AccuCloud\PdfToHtml\Config::set('pdfinfo.bin', '/usr/local/bin/pdfinfo'); ?>
Passing options to getDOM
By default getDom()
extracts all images and creates a html file per page. You can pass options when extracting html:
<?php $pdfDom = $pdf->getDom(['ignoreImages' => true]);
Available Options
- singlePage, default: false
- imageJpeg, default: false
- ignoreImages, default: false
- zoom, default: 1.5
- noFrames, default: true
Usage note for Windows Users
For those who need this package in windows, there is a way. First download poppler-utils for windows here http://blog.alivate.com.au/poppler-windows/. And download the latest binary.
After download it, extract it. There will be a directory called bin
. We will need this one. Then change your code like this
<?php // if you are using composer, just use this // not needed if your framework is already autoloading include 'vendor/autoload.php'; use AccuCloud\PdfToHtml\Config; // change pdftohtml bin location Config::set('pdftohtml.bin', 'C:/poppler-0.37/bin/pdftohtml.exe'); // change pdfinfo bin location Config::set('pdfinfo.bin', 'C:/poppler-0.37/bin/pdfinfo.exe'); // initiate $pdf = new AccuCloud\PdfToHtml\Pdf('file.pdf'); // convert to html and return it as [Dom Object](https://github.com/paquettg/php-html-parser) $html = $pdf->html(); // check if your pdf has more than one pages $total_pages = $pdf->getPages(); // Your pdf happen to have more than one pages and you want to go another page? Got it. use this command to change the current page to page 3 $html->goToPage(3); // and then you can do as you please with that dom, you can find any element you want $paragraphs = $html->find('body > p'); ?>
Usage note for OS/X Users
Thanks to @kaleidoscopique for giving a try and make it run on OS/X for this package
1. Install brew
Brew is a famous package manager on OS/X : http://brew.sh/ (aptitude style).
2. Install poppler
brew install poppler
3. Verify the path of pdfinfo and pdftohtml
$ which pdfinfo /usr/local/bin/pdfinfo $ which pdftohtml /usr/local/bin/pdfinfo
4. Whatever the paths are, use AccuCloud\PdfToHtml\Config::set
to set them in your php code. Obviously, use the same path as the one given by the which
command;
<?php // if you are using composer, just use this include 'vendor/autoload.php'; // change pdftohtml bin location \AccuCloud\PdfToHtml\Config::set('pdftohtml.bin', '/usr/local/bin/pdftohtml'); // change pdfinfo bin location \AccuCloud\PdfToHtml\Config::set('pdfinfo.bin', '/usr/local/bin/pdfinfo'); // initiate $pdf = new AccuCloud\PdfToHtml\Pdf('file.pdf'); // convert to html and return it as [Dom Object](https://github.com/paquettg/php-html-parser) $html = $pdf->html(); ?>
Feedback & Contribute
Send me an issue for improvement or any buggy thing. I love to help and solve another people problems. Thanks 👍