caseyamcl/phpoaipmh

A PHP OAI-PMH 2.0 Harvester library

v3.3.1 2023-01-27 17:40 UTC

This package is auto-updated.

Last update: 2024-08-24 17:17:01 UTC


README

A PHP OAI-PMH harvester client library

Latest Version Total Downloads Software License Github Build Code coverage Scrutinizer

This library provides an interface to harvest OAI-PMH metadata from any OAI 2.0 compliant endpoint.

Features:

  • PSR-12 Compliant
  • Composer-compatible
  • Unit-tested
  • Prefers Guzzle (v6, v7, or v5) for HTTP transport layer, but can fall back to cURL, or implement your own
  • Easy-to-use iterator that hides all the HTTP junk necessary to get paginated records

Installation Options

Install via Composer by including the following in your composer.json file:

{
    "require": {
        "caseyamcl/phpoaipmh": "^3.0",
        "guzzlehttp/guzzle":   "^7.0"
    }
}

Or, drop the src folder into your application and use a PSR-4 autoloader to include the files.

Note: Guzzle v6.0 or v7.0 is recommended, but if you do not wish to use Guzzle v6 for whatever reason, you can use any one of the following:

  • Guzzle 5.0 - You can use Guzzle v5 instead of v6.
  • cURL - This library will fall back to using cURL if Guzzle is not installed.
  • Build your own - You can use a different HTTP client library by passing your own implementation of the Phpoaipmh\HttpAdapter\HttpAdapterInterface to the Phpoaipmh\Client constructor.

Upgrading

There are several backwards-incompatible API improvements in major version changes. See <UPGRADE.md> for information about how to upgrade your code to use the new version.

Usage

Setup a new endpoint client:

// Quick and easy 'build' method 
$myEndpoint = \Phpoaipmh\Endpoint::build('http://some.service.com/oai');

// Or, create your own client instance and pass it to `Endpoint::__construct()` 
$client = new \Phpoaipmh\Client('http://some.service.com/oai');
$myEndpoint = new \Phpoaipmh\Endpoint($client);

Get basic information:

// Result will be a SimpleXMLElement object
$result = $myEndpoint->identify();
var_dump($result);

// Results will be iterator of SimpleXMLElement objects
$results = $myEndpoint->listMetadataFormats();
foreach($results as $item) {
    var_dump($item);
}

Retrieving records

// Recs will be an iterator of SimpleXMLElement objects
$recs = $myEndpoint->listRecords('someMetaDataFormat');

// The iterator will continue retrieving items across multiple HTTP requests.
// You can keep running this loop through the *entire* collection you
// are harvesting.  All OAI-PMH and HTTP pagination logic is hidden neatly
// behind the iterator API.
foreach($recs as $rec) {
    var_dump($rec);
}

Limiting record retrieval by date/time

Simply pass instances of DateTimeInterface to Endpoint::listRecords() or Endpoint::listIdentifiers() as arguments two and three, respectively.

If you want one and not another, you can pass null for either argument.

// Retrieve records from Jan 1, 2018 through October 1, 2018
$recs = $myEndpoint->listRecords('someMetaDataFormat', new \DateTime('2018-01-01'), new \DateTime('2018-10-01'));

foreach($recs as $rec) {
    var_dump($rec);
}

Setting date/time granularity

This library will attempt to retrieve granularity automatically from the OAI-PMH Identify endpoint, but in case you want to set it your self manually, you can pass an instance of Granularity to the Endpoint constructor:

use Phpoaipmh\Client,
    Phpoaipmh\Endpoint,
    Phpoaipmh\Granularity;

$client = new Client('http://some.service.com/oai');
$myEndpoint = new Endpoint($client, Granularity::DATE_AND_TIME);

Record sets

Some OAI-PMH endpoints sub-divide records into sets.

You can list the record sets available for a given endpoint by calling Endpoint::listSets():

foreach ($myEndpoint->listSets() as $set) {
    var_dump($set);
}

You can specify the set you wish to retrieve by passing the set name as the fourth argument to Endpoint::listIdentifiers() or Endpoint::listRecords():

foreach ($myEndpoint->listRecords('someMetadataFormat', null, null 'someSetName') as $record) {
    var_dump($record);
}

Getting total record count

Some endpoints provide a total record count for your query. If the endpoint provides this, you can access this value by calling: RecordIterator::getTotalRecordCount().

If the endpoint does not provide this count, then RecordIterator::getTotalRecordCount() returns null.

$iterator = $myEndpoint->listRecords('someMetaDataFormat');
echo "Total count is " . ($iterator->getTotalRecordCount() ?: 'unknown');

Handling Results

Depending on the verb you use, the library will send back either a SimpleXMLELement or an iterator containing SimpleXMLElement objects.

  • For identify and getRecord, a SimpleXMLElement object is returned
  • For listMetadataFormats, listSets, listIdentifiers, and listRecords a Phpoaipmh\ResponseIterator is returned

The Phpoaipmh\ResponseIterator object encapsulates the logic to iterate through paginated sets of records.

Handling Errors

This library will throw different exceptions under different circumstances:

  • HTTP request errors will generate a Phpoaipmh\Exception\HttpException
  • Response body parsing issues (e.g. invalid XML) will generate a Phpoaipmh\Exception\MalformedResponseException
  • OAI-PMH protocol errors (e.g. invalid verb or missing params) will generate a Phpoaipmh\Exception\OaipmhException

All exceptions extend the Phpoaipmh\Exception\BaseoaipmhException class.

Customizing Default Request Options

You can customize the default request options (for example, request timeout) for both cURL and Guzzle clients by building the adapter objects manually.

If you're using Guzzle v6, you can set default options by building your own Guzzle client and setting parameters in the constructor:

use GuzzleHttp\Client as GuzzleClient;
use Phpoaipmh\Client;
use Phpoaipmh\Endpoint;
use Phpoaipmh\HttpAdapter\GuzzleAdapter;

$guzzle = new GuzzleAdapter(new GuzzleClient([
    'connect_timeout' => 2.0,
    'timeout'         => 10.0
]));

$myEndpoint = new Endpoint(new Client('http://some.service.com/oai', $guzzle));

If you're using cURL, you can set request options by passing them in as an array of key/value items to CurlAdapter::setCurlOpts():

use Phpoaipmh\Client,
    Phpoaipmh\HttpAdapter\CurlAdapter;

$adapter = new CurlAdapter();
$adapter->setCurlOpts([CURLOPT_TIMEOUT => 120]);
$client = new Client('http://some.service.com/oai', $adapter);

$myEndpoint = new Endpoint($client);

If you're using Guzzle v5, you can set default options by building your own Guzzle client,

use Phpoaipmh\Client,
    Phpoaipmh\HttpAdapter\GuzzleAdapter;

$adapter = new GuzzleAdapter();
$adapter->getGuzzleClient()->setDefaultOption('timeout', 120);
$client = new Client('http://some.service.com/oai', $adapter);

$myEndpoint = new Endpoint($client);

Dealing with XML Namespaces

Many OAI-PMH XML documents make use of XML Namespaces. For non-XML experts, it can be confusing to implement these in PHP. SitePoint has a brief but excellent overview of how to use Namespaces in SimpleXML.

Iterator Metadata

The Phpoaipmh\RecordIterator iterator contains some helper methods:

  • getNumRequests() - Returns the number of HTTP requests made thus far
  • getNumRetrieved() - Returns the number of individual records retrieved
  • reset() - Resets the iterator, which will restart the record retrieval from scratch.

Handling 503 Retry-After Responses

Some OAI-PMH endpoints employ rate-limiting so that you can only make X number of requests in a given time period. These endpoints will return a 503 Retry-AFter HTTP status code if your code generates too many HTTP requests too quickly.

Guzzle v6

If you have installed Guzzle v6, then you can use the Guzzle-Retry-Middleware library to automatically handle OAI-PMH endpoint rate limiting rules.

First, include the middleware as a dependency in your app:

composer require caseyamcl/guzzle_retry_middleware

Then, when loading the Phpoaipmh libraries, build a Guzzle client manually, and add the middleware to the stack. Example:

use GuzzleRetry\GuzzleRetryMiddleware;
use GuzzleHttp\Client as GuzzleClient;
use GuzzleHttp\HandlerStack;

// Setup the the Guzzle client with the retry middleware
$stack = HandlerStack::create();
$stack->push(GuzzleRetryMiddleware::factory());
$guzzleClient = new GuzzleClient(['handler' => $stack]);

// Setup the Guzzle adpater and PHP OAI-PMH client
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\GuzzleAdapter($guzzleClient);
$client  = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);

This will create a client that automatically retries requests when OAI-PMH endpoints send 503 rate-limiting responses.

The Retry middleware contains a number of options. Refer to the README for that package for details.

Guzzle v5

If you have installed Guzzle v5, then you can use the Retry-Subscriber to automatically handle OAI-PMH endpoint rate-limiting rules.

First, include the retry-subscriber as a dependency in your composer.json:

require: {
    /* ... */
   "guzzlehttp/retry-subscriber": "~2.0"
}

Then, when loading the Phpoaipmh libraries, instantiate the Guzzle adapter manually, and add the subscriber as indicated in the code below:

// Create a Retry Guzzle Subscriber
$retrySubscriber = new \GuzzleHttp\Subscriber\Retry\RetrySubscriber([
    'delay' => function($numRetries, \GuzzleHttp\Event\AbstractTransferEvent $event) {
        $waitSecs = $event->getResponse()->getHeader('Retry-After') ?: '5';
        return ($waitSecs * 1000) + 1000; // wait one second longer than the server said to
    },
    'filter' => \GuzzleHttp\Subscriber\Retry\RetrySubscriber::createStatusFilter(),
]);

// Manually create a Guzzle HTTP adapter
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\GuzzleAdapter();
$guzzleAdapter->getGuzzleClient()->getEmitter()->attach($retrySubscriber);

$client  = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);

This will create a client that automatically retries requests when OAI-PMH endpoints send 503 rate-limiting responses.

Sending Arbitrary Query Parameters

If you wish to send arbitrary HTTP query parameters with your requests, you can send them via the \Phpoaipmh\Client class:

$client = new \Phpoaipmh\Client('http://some.service.com/oai');
$client->request('Identify', ['some' => 'extra-param']);

Alternatively, if you wish to send arbitrary parameters while taking advantage of the convenience of the \Phpoaipmh\Endpoint class, you can use the Guzzle Param Middleware library:

First, include the middleware as a dependency in your app:

$ composer require emarref/guzzle-param-middleware

Then, when loading the Phpoaipmh libraries, build a Guzzle client manually, and add the middleware to the stack. Example:

use Emarref\Guzzle\Middleware\ParamMiddleware
use GuzzleHttp\Client as GuzzleClient;
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use Psr\Http\Message\RequestInterface;

// Setup the the Guzzle stack
$stack = HandlerStack()::create();
$stack->push(new ParamMiddleware(['api_key' => 'xyz123']));

// Setup Guzzle client, adapter, and PHP OAI-PMH client
$guzzleClient = new GuzzleClient(['handler' => $stack])
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\GuzzleAdapter($guzzleClient)
$client  = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);

This will add the specified query parameters to all requests for the client.

Sending arbitrary query parameters with Guzzle v5

If you are using Guzzle v5, you can use the Guzzle event system:

// Create a function or class to add parameters to a request
$addParamsListener = function(\GuzzleHttp\Event\BeforeEvent $event) {
   $req = $event->getRequest();
   $req->getQuery()->add('api_key', 'xyz123');

   // You could do other things to the request here, too, like adding a header..
   $req->addHeader('Some-Header', 'some-header-value');
};

// Manually create a Guzzle HTTP adapter
$guzzleAdapter = new \Phpoaipmh\HttpAdapter\GuzzleAdapter();
$guzzleAdapter->getGuzzleClient()->getEmitter()->on('before', $addParamsListener);

$client  = new \Phpoaipmh\Client('http://some.service.com/oai', $guzzleAdapter);

Implementation Tips

Harvesting data from a OAI-PMH endpoint can be a time-consuming task, especially when there are lots of records. Typically, this kind of task is done via a CLI script or background process that can run for a long time. It is not normally a good idea to make it part of a web request.

Credits

License

MIT License; see LICENSE file for details