nilgems/laravel-textract

A Laravel package to extract text from files like DOC, XL, Image, Pdf and more. I've developed this package by inspiring "npm textract".

v1.1 2022-06-13 08:35 UTC

This package is auto-updated.

Last update: 2024-04-10 22:34:35 UTC


README

Packagist

Laravel Textract

A Laravel package to extract text from files like DOC, Excel, Image, Pdf and more.

Versions and compatibility

Supported file formats

Following file formats is supported currently. You need to install proper extensions to your server to work with all the following extension related files. The package will check file content MIME type before execute.

  • HTML
  • TEXT
  • DOC
  • DOCX
  • XLS, XLSX, XLSM, XLTX, XLTM, XLT
  • CSV
  • PDF
  • Image
    • jpeg
    • png
    • gif
  • ODT
  • ODS
  • RTF
  • PPTX (NEW)

We are working hard to make this laravel plugin useful. If you found any issue please add a post on discussion.

Installation

composer require nilgems/laravel-textract

Once installed you can do stuff like this:

# Run the extractor
$output = Textract::run('/path/to/file.extension');

# Display the extracted text
echo $output->text;

# Display the extracted text word count
echo $output->word_count;

# Display the extracted text with direct string conversion
echo (string) $output;

Run the extractor to any supported file:

Textract::run(string $file_path, [string $job_id],[TesseractOcrOptions $extra_data]);
Option Type Default value Required Description
$file_path String No default value Yes Text extractable file absolute path.
$job_id String NULL No It's a optional parameter. Extraction job id. If this option is blank the plugin will auto create the ID
$extra_data TesseractOcrOptions NULL No It's a optional parameter. To pass extra parameter. If you are extracting a image file, you can mention languages and more by this Nilgems\PhpTextract\ExtractorService\Ocr\Contracts\TesseractOcrOptions parameter.

Configuration

  • You can add provider in app.php under the config folder of your Laravel project. It's optional, the package automatically load the service provider in your application.
    'providers' => [
      ...
      Nilgems\PhpTextract\Providers\ServiceProvider,
      ...
    ]
    
  • Add alias in app.php under the config folder of your Laravel project. It's optional, the package automatically load the facade in your application.
    'aliases' => [
      ...
      'Textract' => Nilgems\PhpTextract\Textract::class,
      ...
    ]
    
  • To publish the config file, run:
    php artisan vendor:publish --tag=textract
    

Example

Example 1:

You can extract text from supported file format.

It is recommended to use the extractor with Laravel Queue Job from better performance.

In php there have a restriction of execution time and memory limit defined in php.ini file with the option max_execution_time and memory_limit. If file size is big, the process may kill forcefully when exceed the limit. You can use queue - database/redis or Laravel horizon to run the process in background.

........
Route::get('/textract', function(){
    return Textract::run('/path/to/image/example.png');
});
........
Example 2:

If you need to specify languages in image file for better extraction output from image file.

........
Route::get('/textract', function(){
    return Textract::run('/path/to/image/example.png', null, [
      'lang' => ['eng', 'jpn', 'spa']
    ]);
});
........

Dependencies

  • To enable the image extraction feature you need to install Tesseract OCR
  • To enable the PDF extraction feature you need to install pdftotext
  • To work properly, your server must have following php extensions installed -
    • ext-fileinfo
    • ext-zip
    • ext-gd or ext-imagick
    • ext-xml

Tesseract OCR Installation

Ubuntu Ubuntu

  • Update the system: sudo apt update
  • Add Tesseract OCR 5 PPA to your system: sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
  • Install Tesseract on Ubuntu 20.04 | 18.04: sudo apt install -y tesseract-ocr
  • Once installation is complete update your system: sudo apt update
  • Verify the installation: tesseract --version

Ubuntu Windows

  • There are many ways to install Tesseract OCR on your system, but if you just want something quick to get up and running, I recommend installing the Capture2Text package with Chocolatey.
  • Choco installation: choco install capture2text --version 5.0

Note: Recent versions of Capture2Text stopped shipping the tesseract binary

PdfToText Installation

Ubuntu Ubuntu

  • Update the system: sudo apt update
  • Install PdfToText on Ubuntu 20.04 | 18.04: sudo apt-get install poppler-utils
  • Verify the installation: pdftotext -v

Ubuntu Windows

License

MIT

💻 Tech Stack

CSS3 PHP HTML5 JavaScript AWS Vue.js Vuetify NPM jQuery Express.js Laravel NuxtJS Socket.io Apache MariaDB MongoDB MySQL SQLite Inkscape Jira Vagrant