tranvantri / pdf-to-html
This PHP class can convert your pdf files to html using poppler-utils.
Requires
- php: >=8.0
- ext-dom: *
- ext-fileinfo: *
- ext-json: *
- pelago/emogrifier: ^v7.2
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.22
- laravel/framework: ^10.0
- mockery/mockery: ~1.0
- orchestra/testbench: ^v8.17.0
- phpstan/phpstan: ^0.12.14
- phpunit/phpunit: ^9.6
This package is auto-updated.
Last update: 2025-04-13 19:38:05 UTC
README
This PHP class can convert your pdf files to html using poppler-utils.
Thanks
Big thanks Mochamad Gufron (mgufrone)! I did a packet based on its package (https://github.com/mgufrone/pdf-to-html).
Important Notes
Please see how to use below.
Installation
composer require tranvantri/pdf-to-html
Requirements
1. Install Poppler-Utils
Debian/Ubuntu
sudo apt-get install poppler-utils
Mac OS X
brew install poppler
Windows
For those who need this package in windows, there is a way. First download poppler-utils for windows here http://blog.alivate.com.au/poppler-windows/. And download the latest binary.
After download it, extract it.
2. We need to know where is utilities
Debian/Ubuntu
$ whereis pdftohtml pdftohtml: /usr/bin/pdftohtml $ whereis pdfinfo pdfinfo: /usr/bin/pdfinfo
Mac OS X
$ which pdfinfo /usr/local/bin/pdfinfo $ which pdftohtml /usr/local/bin/pdfinfo
Windows
Go in extracted directory. There will be a directory called bin
. We will need this one.
3. PHP Configuration with shell access enabled
Usage
Example:
<?php // if you are using composer, just use this include 'vendor/autoload.php'; // initiate $pdf = new \TvT\PdfToHtml\Pdf('test.pdf', [ 'pdftohtml_path' => '/usr/bin/pdftohtml', 'pdfinfo_path' => '/usr/bin/pdfinfo' ]); // example for windows // $pdf = new \TvT\PdfToHtml\Pdf('test.pdf', [ // 'pdftohtml_path' => '/path/to/poppler/bin/pdftohtml.exe', // 'pdfinfo_path' => '/path/to/poppler/bin/pdfinfo.exe' // ]); // get pdf info $pdfInfo = $pdf->getInfo(); // get count pages $countPages = $pdf->countPages(); // get content from one page $contentFirstPage = $pdf->getHtml()->getPage(1); // get content from all pages and loop for they foreach ($pdf->getHtml()->getAllPages() as $page) { echo $page . '<br/>'; }
Full list settings:
<?php $fullSettings = [ 'pdftohtml_path' => '/usr/bin/pdftohtml', // path to pdftohtml 'pdfinfo_path' => '/usr/bin/pdfinfo', // path to pdfinfo 'generate' => [ // settings for generating html 'singlePage' => false, // we want separate pages 'imageJpeg' => false, // we want png image 'ignoreImages' => false, // we need images 'zoom' => 1.5, // scale pdf 'noFrames' => false, // we want separate pages ], 'clearAfter' => true, // auto clear output dir (if removeOutputDir==false then output dir will remain) 'removeOutputDir' => true, // remove output dir 'outputDir' => '/tmp/'.uniqid(), // output dir 'html' => [ // settings for processing html 'inlineCss' => true, // replaces css classes to inline css rules 'inlineImages' => true, // looks for images in html and replaces the src attribute to base64 hash 'onlyContent' => true, // takes from html body content only ] ]