nekulin/php-apache-tika

Apache Tika bindings for PHP: extracts text from documents and images (with OCR), metadata and more...

0.3.0 2015-12-13 22:50 UTC

This package is not auto-updated.

Last update: 2024-05-03 16:09:48 UTC


README

This tool provides Apache Tika bindings for PHP, allowing to extract text and metadata from documents, images and other formats.

Two modes are supported:

Server mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background.

Features

  • Simple class interface to Apache Tika features:
    • Text and HTML extraction
    • Metadata extraction
    • OCR recognition
  • Standarized metadata for documents
  • Support for local and remote resources
  • No heavyweight library dependencies

Requirements

  • PHP 5.4 or greater
  • Apache Tika 1.7 or greater
  • Oracle Java or OpenJDK
    • Java 6 for Tika up to 1.9
    • Java 7 for Tika 1.10 or greater
  • Tesseract (optional for OCR recognition)

Installation

Install using composer:

composer require vaites/php-apache-tika

If you want to use OCR you must install Tesseract:

  • Fedora/CentOS: sudo yum install tesseract (use dnf instead of yum on Fedora 22 or greater)
  • Debian/Ubuntu: sudo apt-get install tesseract-ocr
  • Mac OS X: brew install tesseract (using Homebrew)

Usage

Start Apache Tika server with caution:

java -jar tika-server-1.10.jar

Instantiate the class:

$client = \Vaites\ApacheTika\Client::make('localhost', 9998);           // server mode (default)
$client = \Vaites\ApacheTika\Client::make('/path/to/tika-app.jar');     // app mode 

Use the class to extract text from documents:

$language = $client->getLanguage('/path/to/your/document');
$metadata = $client->getMetadata('/path/to/your/document');

$html = $client->getHTML('/path/to/your/document');
$text = $client->getText('/path/to/your/document');

Or use to extract text from images:

$client = \Vaites\ApacheTika\Client::make($host, $port);
$metadata = $client->getMetadata('/path/to/your/image');

$text = $client->getText('/path/to/your/image');

Integrations