xatham/text-extraction

Easy text extraction for many different file types

0.0.2 2021-09-25 19:25 UTC

This package is auto-updated.

Last update: 2024-03-26 00:49:26 UTC


README

PHP Composer

text-extraction

About

This PHP-Library let's you extract plain text from various document types.

Currently supported file mime-types for extraction are:

text/plain

text/csv

application/vnd.ms-excel

application/vnd.oasis.opendocument.text

application/pdf

application/msword'

Install

composer require xatham/text-extraction

Usage

/**
 * Extracting only pdf files, without ocr capturing
 */
$textExtractor = (new TextExtractionBuilder())->buildTextExtractor(
    [
        'withOcr' => false,
        'validMimeTypes' =>  ['application/pdf'],
    ],
);

$target = dirname(__DIR__) . '/examples/sample.pdf';
$plainTextDocument = $textExtractor->extractByFilePath($target);
if ($plainTextDocument === null) {
    exit('Could not extract any data');
}
$texts = $plainTextDocument->getTextItems();

foreach ($texts as $text) {
    var_dump($text);
}

License

text-extraction is licensed under MIT.