teon/text-extraction

Text Extraction Library

This package's canonical repository appears to be gone and the package has been frozen as a result.

v0.2.0 2015-12-01 03:15 UTC

This package is not auto-updated.

Last update: 2024-01-20 14:43:15 UTC


README

PHP library for extracting text from various documents using various drivers and strategies.

Multiple extractors for each file extension/media-type are supported.

Usage

Library has two modes of operation, and two submodes. Modes: 1.) Use one of extraction strategies and let the library do all the work (requires fileinfo PHP extension); or 2.) Get extractor(s) for your file type (by stating explicitly either media type or file extension) and using them manually.

Submodes: a) operate with file path, or b) operate with file contents in a string

Installation with composer

composer require teon/text-extraction

General usage

1.) Fully-automatic mode:

// Instantiate
$TextExtraction = new \Teon\Text\Extraction\Extraction();

// Submode a):
$text1 = $TextExt->fromFile($filePath);

// Submode b):
$text2 = $TextExt->fromString($fileContent);

2.) Manual extractor selection mode

// Instantiate
$TextExtraction    = new \Teon\Text\Extraction\ExtractorRegistry();
$ExtractorRegistry = $TextExtraction->getRegistry();

// Get appropriate extractors
$extractors1 = $ExtractorRegistry->getByMediaType($fileMediaType);
$extractors2 = $ExtractorRegistry->getByExtension($fileExtension);

// Do your magic to decide which extractor to use
$Extractor1 = $extractors1[0];
$Extractor2 = $extractors2[0];

// Submode a):
$text1 = $Extractor1->fromFile($filePath);

// Submode b):
$text2 = $Extractor2->fromString($fileContent);

Before using it, you may reconfigure it:

// Get default configuration
$config = \Teon\Text\Extraction\Extraction::getDefaultConfiguration();

// Adjust it
$config['strategy']['class'] = "\\My\\Super\\Dooper\\TextExtractionStrategy"

// Instantiate with adjusted configuration
$TextExtraction = new \Teon\Text\Extraction\Extraction($config);

// Start using it
// ...

Usage in framework: Symfony

Install with composer, as described above:

composer require teon/text-extraction

Adjust configuration settings (app/config/config.yml or parameters.yml):

teon_text_extraction:
    strategy:
        class: ConcatOutput
    extractor:
        pdfocr:
            enabled: true
            command: my-convert-pdf-to-tiff-and-run-tesseract.sh

See the Resources/config/config.yml file for what is tuneable, or print default configuration.

Register bundle:

/*
 * FILE: app/AppKernel.php
 */
    // ...
    public function registerBundles()
    {
        $bundles = array(
            // ...
            new Teon\Text\Extraction\TeonTextExtractionBundle(),
            // ...
        );

        return $bundles;
    }
    // ...

Use in your controller:

/*
 * FILE:  src/YourApp/Controller/TextExtractionController.php
 */
    // ...
    public function extractAction ()
    {
        // Get service
        $TextExtraction = $this->get('teon_text_extraction');

        // See section "General usage" above
        // ...
    }
    // ...

Usage in framework #2: TODO, patches welcome.