nilgems / laravel-textract
A Laravel package to extract text from files like DOC, XL, Image, Pdf and more. I've developed this package by inspiring "npm textract".
Installs: 1 657
Dependents: 0
Suggesters: 0
Security: 0
Stars: 17
Watchers: 2
Forks: 7
Open Issues: 3
Requires
- php: ^7.4|^8.0
- ext-fileinfo: *
- ext-gd: *
- ext-xml: *
- ext-zip: *
- html2text/html2text: ^4.3
- laravel/framework: ^8.0|^9.0
- lywzx/php-epub: ^0.1.2
- phpoffice/phpspreadsheet: ^1.23
- phpoffice/phpword: ^0.18
- stechstudio/laravel-php-cs-fixer: ^3.1
- symfony/process: ^6.1
- thiagoalessio/tesseract_ocr: ^2.12
Requires (Dev)
- friendsofphp/php-cs-fixer: ^v3.8.0
- phpunit/phpunit: ^9.5
This package is auto-updated.
Last update: 2024-11-10 23:36:18 UTC
README
Laravel Textract
A Laravel package to extract text from files like DOC, Excel, Image, Pdf and more.
Versions and compatibility
- Laravel 10 or higher is required.
- Php 8.2 or higher is required
Supported file formats
Following file formats is supported currently. You need to install proper extensions to your server to work with all the following extension related files. The package will check file content MIME type before execute.
- HTML
- TEXT
- DOC
- DOCX
- XLS, XLSX, XLSM, XLTX, XLTM, XLT
- CSV
- Image
- jpeg
- png
- gif
- ODT
- ODS
- RTF
- PPTX (NEW)
We are working hard to make this laravel plugin useful. If you found any issue please add a post on discussion.
Installation
composer require nilgems/laravel-textract
Once installed you can do stuff like this:
# Run the extractor
$output = Textract::run('/path/to/file.extension');
# Display the extracted text
echo $output->text;
# Display the extracted text word count
echo $output->word_count;
# Display the extracted text with direct string conversion
echo (string) $output;
Run the extractor to any supported file:
Textract::run(string $file_path, [string $job_id],[TesseractOcrOptions $extra_data]);
Configuration
- You can add provider in
app.php
under theconfig
folder of your Laravel project. It's optional, the package automatically load the service provider in your application.'providers' => [ ... Nilgems\PhpTextract\Providers\ServiceProvider, ... ]
- Add alias in
app.php
under theconfig
folder of your Laravel project. It's optional, the package automatically load thefacade
in your application.'aliases' => [ ... 'Textract' => Nilgems\PhpTextract\Textract::class, ... ]
- To publish the
config
file, run:php artisan vendor:publish --tag=textract
Example
Example 1:
You can extract text from supported file format.
It is recommended to use the extractor with Laravel Queue Job from better performance.
In php
there have a restriction of execution time and memory limit defined in php.ini
file with the option max_execution_time
and memory_limit
. If file size is big, the process may kill forcefully when exceed the limit. You can use queue - database/redis
or Laravel horizon
to run the process in background.
........
Route::get('/textract', function(){
return Textract::run('/path/to/image/example.png');
});
........
Example 2:
If you need to specify languages in image file for better extraction output from image file.
........
Route::get('/textract', function(){
return Textract::run('/path/to/image/example.png', null, [
'lang' => ['eng', 'jpn', 'spa']
]);
});
........
Dependencies
- To enable the image extraction feature you need to install Tesseract OCR
- To enable the PDF extraction feature you need to install pdftotext
- To work properly, your server must have following php extensions installed -
- ext-fileinfo
- ext-zip
- ext-gd or ext-imagick
- ext-xml
Tesseract OCR Installation
Ubuntu
- Update the system:
sudo apt update
- Add Tesseract OCR 5 PPA to your system:
sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel
- Install Tesseract on Ubuntu 20.04 | 18.04:
sudo apt install -y tesseract-ocr
- Once installation is complete update your system:
sudo apt update
- Verify the installation:
tesseract --version
Windows
- There are many ways to install Tesseract OCR on your system, but if you just want something quick to get up and running, I recommend installing the Capture2Text package with Chocolatey.
- Choco installation:
choco install capture2text --version 5.0
Note: Recent versions of Capture2Text stopped shipping the tesseract
binary
PdfToText Installation
Ubuntu
- Update the system:
sudo apt update
- Install PdfToText on Ubuntu 20.04 | 18.04:
sudo apt-get install poppler-utils
- Verify the installation:
pdftotext -v
Windows
- Sorry but
pdftotext
available via poppler and the poppler is not available yet for windows. But you can install and use the library by windows linux sub-system WLS. Alternatively, you can install Laravel Homestead in your project and using vagrant virtualization you can run the project in ubuntu virtual server.