brick / structured-data
Microdata, RDFa Lite & JSON-LD structured data reader
Fund package maintenance!
BenMorel
Installs: 81 224
Dependents: 1
Suggesters: 0
Security: 0
Stars: 23
Watchers: 5
Forks: 7
Open Issues: 2
Requires (Dev)
- php-coveralls/php-coveralls: ^2.0
- phpunit/phpunit: ^8.0 || ^9.0
This package is auto-updated.
Last update: 2024-09-03 00:09:37 UTC
README
A PHP library to read Microdata, RDFa Lite & JSON-LD structured data in HTML pages.
This library is a foundation to read schema.org structured data in brick/schema, but may be used with other vocabularies.
Installation
This library is installable via Composer:
composer require brick/structured-data
Requirements
This library requires PHP 7.2 or later. It makes use of the following extensions:
These extensions are enabled by default, and should be available in most PHP installations.
Project status & release process
This library is under development. It is likely to change fast in the early 0.x
releases. However, the library follows a strict BC break convention:
The current releases are numbered 0.x.y
. When a non-breaking change is introduced (adding new methods, fixing bugs,
optimizing existing code, etc.), y
is incremented.
When a breaking change is introduced, a new 0.x
version cycle is always started.
It is therefore safe to lock your project to a given release cycle, such as 0.1.*
.
If you need to upgrade to a newer release cycle, check the release history
for a list of changes introduced by each further 0.x.0
version.
Introduction
The library unifies reading the 3 supported formats (Microdata, RDFa Lite & JSON-LD) under a common interface:
interface Brick\StructuredData\Reader { /** * Reads the items contained in the given document. * * @param DOMDocument $document The DOM document to read. * @param string $url The URL the document was retrieved from. This will be used only to resolve relative * URLs in property values. No attempt will be performed to connect to this URL. * * @return Item[] The top-level items. */ public function read(DOMDocument $document, string $url) : array; }
There are 3 implementations of this interface, one for each format:
MicrodataReader
RdfaLiteReader
JsonLdReader
The read()
method returns the top-level items found in the document. Every Item
consists of:
- An optional id (
itemid
in Microdata,resource
in RDFa Lite,@id
in JSON-LD) - An array of zero or more types; each type is a URL, for example
http://schema.org/Product
- An associative array of zero or more properties; each property has a URL as a key, for example
http://schema.org/price
, and maps to an array of one or more values; values can be plain strings, or nestedItem
objects
Quickstart
Here is a working example that reads Microdata from a web page. Just change the URL and give it a try:
use Brick\StructuredData\Reader\MicrodataReader; use Brick\StructuredData\HTMLReader; use Brick\StructuredData\Item; // Let's read Microdata here; // You could also use RdfaLiteReader, JsonLdReader, // or even use all of them by chaining them in a ReaderChain $microdataReader = new MicrodataReader(); // Wrap into HTMLReader to be able to read HTML strings or files directly, // i.e. without manually converting them to DOMDocument instances first $htmlReader = new HTMLReader($microdataReader); // Replace this URL with that of a website you know is using Microdata $url = 'http://www.example.com/'; $html = file_get_contents($url); // Read the document and return the top-level items found // Note: the URL is only required to resolve relative URLs; no attempt will be made to connect to it $items = $htmlReader->read($html, $url); // Loop through the top-level items foreach ($items as $item) { echo implode(',', $item->getTypes()), PHP_EOL; foreach ($item->getProperties() as $name => $values) { foreach ($values as $value) { if ($value instanceof Item) { // We're only displaying the class name in this example; you would typically // recurse through nested Items to get the information you need $value = '(' . implode(', ', $value->getTypes()) . ')'; } // If $value is not an Item, then it's a plain string echo " - $name: $value", PHP_EOL; } } }
Current limitations
- No support for the
itemref
attribute inMicroDataReader
- No support for the
prefix
attribute inRdfaLiteReader
; only predefined prefixes are supported right now - No proper support for
@context
inJsonLdReader
; right now, only strings are accepted in@context
, and they are considered a vocabulary identifier; this works fine with simple markup like the one used in the examples on schema.org, but may fail with more complex documents.
Note about JSON-LD's @context
While JsonLdReader
should be able to handle a proper context object in the future, its goal will never be to be a
fully compliant JSON-LD parser; in particular, it will never attempt to fetch a JSON-LD context referenced by a URL.
This is consistent with how indexing robots typically crawl the web, they do not fetch remote contexts, which relieves them from fetching additional documents to extract structured data from a web page.
The aim of JsonLdReader
, and the other Reader
implementations for that matter, is to be able to parse a document with the same capabilities as Google Structured Data Testing Tool or Yandex Structured data validator, no more, no less. These tools do not load external context files.