xatham / text-extraction
Easy text extraction for many different file types
0.0.2
2021-09-25 19:25 UTC
Requires
- php: >=7.4
- ext-fileinfo: *
- ext-imagick: *
- league/flysystem: ^2.0
- phpoffice/phpspreadsheet: ^1.15
- phpoffice/phpword: ^0.17.0 | ^0.18.2
- shuchkin/simplexlsx: ^0.8.19
- smalot/pdfparser: ^0.17.1
- symfony/finder: ^5.2
- thiagoalessio/tesseract_ocr: ^2.9
Requires (Dev)
- friendsofphp/php-cs-fixer: ^2.17
- phpmd/phpmd: ^2.9
- phpspec/prophecy-phpunit: ^2.0
- phpstan/phpstan: ^0.12.62
- phpunit/phpunit: ^9.5
This package is auto-updated.
Last update: 2025-03-26 03:01:22 UTC
README
text-extraction
About
This PHP-Library let's you extract plain text from various document types.
Currently supported file mime-types for extraction are:
text/plain
text/csv
application/vnd.ms-excel
application/vnd.oasis.opendocument.text
application/pdf
application/msword'
Install
composer require xatham/text-extraction
Usage
/** * Extracting only pdf files, without ocr capturing */ $textExtractor = (new TextExtractionBuilder())->buildTextExtractor( [ 'withOcr' => false, 'validMimeTypes' => ['application/pdf'], ], ); $target = dirname(__DIR__) . '/examples/sample.pdf'; $plainTextDocument = $textExtractor->extractByFilePath($target); if ($plainTextDocument === null) { exit('Could not extract any data'); } $texts = $plainTextDocument->getTextItems(); foreach ($texts as $text) { var_dump($text); }
License
text-extraction is licensed under MIT.