joest8/pdfinterpreter

This class is designed to convert multiple PDF files, whether image-based or text-based, into an array of data.The class uses user-defined templates containing regular expressions to control the data extraction process, allowing for customized and flexible output.

v1.0 2023-11-05 19:04 UTC

This package is auto-updated.

Last update: 2025-07-06 00:07:20 UTC


README

Introduction

This class is designed to convert multiple PDF files, whether image-based or text-based, into an array of data. The class uses user-defined templates containing regular expressions to control the data extraction process, allowing for customized and flexible output.

Table of Contents

This README is divided into several sections:

Installation

composer require joest8/pdfinterpreter

Console Applications

To use this class, you'll need to install the following applications:

  1. Poppler (necessary to convert pdf to text and get information about number of pages in file)
  2. Tesseract (necessary to read and interpret png file)
  3. ImageMagick (necessary to convert pdf->png)

Make sure you have a package-manager installed on your system.

Automated installation

Run the following code from the source folder to autoinstall all dependencies and tesseract language files:

php install/install_dependencies.php

Manual installation with homebrew

If homebrew is installed run the following commands to install the Homebrew packages:

brew install poppler tesseract imagemagick

Manual installation of Tesseract Language Files

You also need to install the required Tesseract language files. You can check the available languages at: https://github.com/tesseract-ocr/tessdata_best/

Download the necessary language files and place them in the appropriate directory. To find the directory use:

tesseract --list-langs

Usage

Create Object

<?php
require_once '../vendor/autoload.php';

use PdfInterpreter\PdfInterpreter;

//get path from terminal: 'echo $PATH'
$path_env = "/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin:/opt/homebrew/bin";
$pdf = new PdfInterpreter($path_env);

Get Sample Output

Using the get_sample_output-Method will allow you to get a sample of a text output without any interpretation of patterns.

<?php
require_once '../vendor/autoload.php';

use PdfInterpreter\PdfInterpreter;

//get path from terminal: 'echo $PATH'
$path_env = "/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin:/opt/homebrew/bin";
$pdf = new PdfInterpreter($path_env);

print_r($pdf->get_sample_output());

Set new template

Using the add_new_template-Method will help you to create a new template. For more informations about the demanded parameters read the DocBloc of the method.

<?php
require_once '../vendor/autoload.php';

use PdfInterpreter\PdfInterpreter;

//get path from terminal: 'echo $PATH'
$path_env = "/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/homebrew/bin:/opt/homebrew/bin";
$pdf = new PdfInterpreter($path_env);

$pdf->add_new_template("sample","Sample","/[Cc]ompany[\W]?[Aa][Bb][Cc]/","1","eng");

Add pattern to template

Using the add_pattern_to_template-Method will help you to add a new pattern to an existing template. For more informations about the demanded parameters read the DocBloc of the method.

$pdf->add_pattern_to_template("sample","invoice_no","/INVOICE # *([\d]*)/","1");
$pdf->add_pattern_to_template("sample","date","/INVOICE DATE *([\d]{2}.[\d]{2}.[\d]{4})/","1");
$pdf->add_pattern_to_template("sample","positions","/([\d]{1,4}) *(.*?) *([\d]{1,8},[\d]{2}) *([\d]{1,8},[\d]{2})/m","a",true,['pieces','item','price','amount']);

Get Template

Using the get_template-Method will return the entire template. For more informations about the demanded parameters read the DocBloc of the method.

print_r($pdf->get_template("sample"));

Delete Template

Using the delete_template-Method will delete the entire template. For more informations about the demanded parameters read the DocBloc of the method.

print_r($pdf->delete_template("sample"));

Convert Files from Folder

Using the convert_folder-Method will convert all files from a folder into an array of data. For more informations about the demanded parameters read the DocBloc of the method.

print_r(print_r($pdf->convert_folder("/../docs/",true,false,ocr_lang: "eng")));

Convert File

Using the convert_file-Method will convert a single file into an array of data. For more informations about the demanded parameters read the DocBloc of the method.

print_r($pdf->convert_file("/../docs/sample-bill.pdf",true,false));