brick/schema

Schema.org library for PHP

Maintainers

Details

github.com/brick/schema

Source

Issues

Fund package maintenance!
BenMorel

0.1.1 2020-03-27 14:36 UTC

This package is auto-updated.

Last update: 2024-03-29 03:51:56 UTC


README

A PHP library to read schema.org structured data from HTML pages.

Latest Stable Version License

Introduction

This library extracts structured data (Microdata, RDFa Lite & JSON-LD) from HTML pages using brick/structured-data, and maps any schema.org information found to IDE- and static analysis- friendly objects implementing schema.org interfaces.

Installation

This library is installable via Composer:

composer require brick/schema

Requirements

This library requires PHP 7.2 or later. It makes use of the following extensions:

These extensions are enabled by default, and should be available in most PHP installations.

Project status & release process

This library is under development. It is likely to change fast in the early 0.x releases. However, the library follows a strict BC break convention:

The current releases are numbered 0.x.y. When a non-breaking change is introduced (adding new methods, fixing bugs, optimizing existing code, etc.), y is incremented.

When a breaking change is introduced, a new 0.x version cycle is always started.

It is therefore safe to lock your project to a given release cycle, such as 0.1.*.

If you need to upgrade to a newer release cycle, check the release history for a list of changes introduced by each further 0.x.0 version.

Quickstart

First of all, you need to instantiate a SchemaReader:

use Brick\Schema\SchemaReader;

// read all available formats:
$schemaReader = SchemaReader::forAllFormats();

// or read a single format:
// $schemaReader = SchemaReader::forMicrodata();
// $schemaReader = SchemaReader::forRdfaLite();
// $schemaReader = SchemaReader::forJsonLd();

Then, you can proceed to reading an HTML document:

// The URL the document was retrieved from. This will be used only to resolve relative
// URLs in property values. No attempt will be performed to connect to this URL.
$url = 'https://example.com/product/123';

$things = $schemaReader->readHtml($html, $url);               // An HTML document as a string
// or $things = $schemaReader->readHtmlFile($htmlFile, $url); // A path to an HTML file
// or $things = $schemaReader->read($domDocument, $url);      // A DOMDocument instance

This returns an array of Thing instances, which is the base class from which all schema.org objects inherit.

All the objects returned implement Thing, but more importantly, they implement the interface(s) defined in the markup, such as Person or Product. These objects pass instanceof checks, and allow IDE autocompletion and static analysis by listing their available properties.

Schema.org interfaces live under the Brick\Schema\Interfaces namespace:

  • Brick\Schema\Interfaces\Thing
    • Brick\Schema\Interfaces\Person
    • ...

Every property of every interface, for example Person::$birthDate is an instance of SchemaTypeList, which is a container for zero or more values. Each value may be another Thing, or a plain text string.

With that in mind, let's see an example:

use Brick\Schema\Interfaces as Schema;

foreach ($things as $thing) {
    if ($thing instanceof Schema\Product) {
        // Your IDE should now provide autocompletion for available Product properties:
        // category, color, gtin, sku, offers, ...

        foreach ($thing->offers as $offer) {
            // You should always check if the Thing matches the expected type,
            // even if the schema.org property documents a single type (here, Offer).
            // See the Caveats section below for an explanation why.

            if ($offer instanceof Schema\Offer) {
                // Yes! we do have an offer, let's check its price.
                // Don't forget that all properties have zero or more values, let's take the first one:

                $price = $offer->price->getFirstValue();

                // For the same reason as above (see Caveats), always check the type of the value,
                // it could very well be a nested Thing instance, or null if there is no value.

                if (is_string($price)) {
                    echo $price;
                }

                // There are also helper functions in the SchemaTypeList object when you expect a string.
                // For example, this will return the first string value, or null if not found:

                $priceCurrency = $offer->priceCurrency->toString();

                if ($priceCurrency !== null) {
                    echo $priceCurrency;
                }
            }
        }
    }
}

Note: if you're attempting to access a property that's not defined on any of the types the Thing object implements, an exception will be thrown.

Caveats

While the schema.org properties are well defined, they are designed to be quite lenient in terms of what values they accept. While they do document expected types for every property (for example, a Product's offers may only be of the Offer type), in practice this library, in accordance with the schema.org datamodel, accepts any Thing or string in any field.

You should therefore take all documented interface property types with a grain of salt, and always perform checks such as instanceof or is_string() in your code.

To quote the schema.org website:

Conformance

Although it might be helpful for search applications if structured data markup always followed schema.org very strictly, in practice this is unrealistic. Our schemas also continue to evolve in response to feedback, discussion and new applications of the data. Where possible we amend existing definitions incrementally rather than introducing lots of new properties for similar use cases. We have consequently based schema.org on a very flexible datamodel, and take a pragmatic view of conformance.

We expect schema.org properties to be used with new types, both from schema.org and from external extensions. We also expect that often, where we expect a property value of type Person, Place, Organization or some other subClassOf Thing, we will get a text string, even if our schemas don't formally document that expectation. In the spirit of "some data is better than none", search engines will often accept this markup and do the best we can. Similarly, some types such as Role and URL can be used with all properties, and we encourage this kind of experimentation amongst data consumers.