README

Laravel Textract

A Laravel package to extract text from files like DOC, Excel, Image, Pdf and more.

Versions and compatibility

Laravel 10 or higher is required.
Php 8.2 or higher is required

Supported file formats

Following file formats is supported currently. You need to install proper extensions to your server to work with all the following extension related files. The package will check file content MIME type before execute.

HTML
TEXT
DOC
DOCX
XLS, XLSX, XLSM, XLTX, XLTM, XLT
CSV
PDF
Image
- jpeg
- png
- gif
ODT
ODS
RTF
PPTX (NEW)

We are working hard to make this laravel plugin useful. If you found any issue please add a post on discussion.

Installation

composer require nilgems/laravel-textract

Once installed you can do stuff like this:

# Run the extractor
$output = Textract::run('/path/to/file.extension');

# Display the extracted text
echo $output->text;

# Display the extracted text word count
echo $output->word_count;

# Display the extracted text with direct string conversion
echo (string) $output;

Run the extractor to any supported file:

Textract::run(string $file_path, [string $job_id],[TesseractOcrOptions $extra_data]);

Option	Type	Default value	Required	Description
$file_path	`String`	No default value	Yes	Text extractable file absolute path.
$job_id	`String`	`NULL`	No	It's a optional parameter. Extraction job id. If this option is blank the plugin will auto create the ID
$extra_data	`TesseractOcrOptions`	`NULL`	No	It's a optional parameter. To pass extra parameter. If you are extracting a image file, you can mention languages and more by this `Nilgems\PhpTextract\ExtractorService\Ocr\Contracts\TesseractOcrOptions` parameter.

Configuration

You can add provider in app.php under the config folder of your Laravel project. It's optional, the package automatically load the service provider in your application.
```
'providers' => [
  ...
  Nilgems\PhpTextract\Providers\ServiceProvider,
  ...
]
```
Add alias in app.php under the config folder of your Laravel project. It's optional, the package automatically load the facade in your application.
```
'aliases' => [
  ...
  'Textract' => Nilgems\PhpTextract\Textract::class,
  ...
]
```

To publish the config file, run:

php artisan vendor:publish --tag=textract

Example

Example 1:

You can extract text from supported file format.

It is recommended to use the extractor with Laravel Queue Job from better performance.

In php there have a restriction of execution time and memory limit defined in php.ini file with the option max_execution_time and memory_limit. If file size is big, the process may kill forcefully when exceed the limit. You can use queue - database/redis or Laravel horizon to run the process in background.

........
Route::get('/textract', function(){
    return Textract::run('/path/to/image/example.png');
});
........

Example 2:

If you need to specify languages in image file for better extraction output from image file.

........
Route::get('/textract', function(){
    return Textract::run('/path/to/image/example.png', null, [
      'lang' => ['eng', 'jpn', 'spa']
    ]);
});
........

Dependencies

To enable the image extraction feature you need to install Tesseract OCR
To enable the PDF extraction feature you need to install pdftotext
To work properly, your server must have following php extensions installed -
- ext-fileinfo
- ext-zip
- ext-gd or ext-imagick
- ext-xml

Tesseract OCR Installation

Ubuntu

Update the system: sudo apt update
Add Tesseract OCR 5 PPA to your system: sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
Install Tesseract on Ubuntu 20.04 | 18.04: sudo apt install -y tesseract-ocr
Once installation is complete update your system: sudo apt update
Verify the installation: tesseract --version

Windows

There are many ways to install Tesseract OCR on your system, but if you just want something quick to get up and running, I recommend installing the Capture2Text package with Chocolatey.
Choco installation: choco install capture2text --version 5.0

Note: Recent versions of Capture2Text stopped shipping the tesseract binary

PdfToText Installation

Ubuntu

Update the system: sudo apt update
Install PdfToText on Ubuntu 20.04 | 18.04: sudo apt-get install poppler-utils
Verify the installation: pdftotext -v

Windows

Sorry but pdftotext available via poppler and the poppler is not available yet for windows. But you can install and use the library by windows linux sub-system WLS. Alternatively, you can install Laravel Homestead in your project and using vagrant virtualization you can run the project in ubuntu virtual server.