gruz/gvc-pdf2pdf

Overlay text from Google Vision Cloud (OCR) over the original PDF-as-image

dev-master 2021-09-21 22:33 UTC

This package is auto-updated.

Last update: 2024-04-22 04:14:48 UTC


README

When needed

You have a PDF as image. You have recognized the text by Google vision cloud (gvc) OCR service. And now want to apply the text to the pdf to make it searchable.

Note! This too makes sense only for PDFs recognized with GVC. For images recognized by GVC you can use better toosls like [https://github.com/dinosauria123/gcv2hocr/]. You may also try to use [https://github.com/mah-jp/pdf4search]. I failed to learn Japaneese.

Problems

It's not possible to use the same font and the same size of text so the text is added approximately to the same place, where it's on image. Do not expect 100% good resut.

Quick Examples

In exmaples folder you can find in pair of files (PDF-as-image and generater by Google vision cloud JSON) and a couple of out files (for production with hidden text and for debug with visible text).

To get the idea see a video demonstrating results [https://streamable.com/r2gyk9]

Installation

Surely you need PHP.

To use in your project:

composer require gruz/gvc-pdf2pdf

OR

git clone git@github.com:gruz/gcv-pdf2pdf.git
cd gcv-pdf2pdf
composer install

Usage

Check the code, usage is obvious. For quick start:

use GvcPdf2Pdf\PdfTextApply;

$jsonFile = __DIR__ .'/01-in-google-vision-cloud.json';
$pdfSourceFile = __DIR__ .'/01-in-pdf-as-image.pdf';

$pdfOutFile1 = __DIR__ .'/01-out-pdf-with-text-hidden.pdf';
$pdf = new PdfTextApply($jsonFile, $pdfSourceFile, $pdfOutFile1);
$pdf->watermark->text = 'gruz.ml';
$pdf->run();

Running example

Singe file

Go to the project folder if not there and run

php example/example.php

This will (renerate) example/01-out-pdf-with-text-hidden.pdf and example/01-out-pdf-with-text-visible-for-debug.pdf files.

Folder

Go to the project folder if not there and run

php example/example-folder.php

It uses files pairs placed in example/in/ and generates output to example/out/

As a result you get regenerated out pdfs.