osmuhin / html-meta
Parses website metadata such as titles, favicons and others
Requires
- php: >=8.2
- ext-mbstring: *
- guzzlehttp/guzzle: >=6.3
- masterminds/html5: >=2.8
- symfony/dom-crawler: >=6.0
- symfony/mime: *
Requires (Dev)
- mockery/mockery: ^1.6.12
- phpunit/phpunit: ^11.5
- symfony/var-dumper: ^7.1
README
HTML Meta is a PHP package for parsing website metadata, such as titles, favicons, OpenGraph tags and others.
Installation
To install the package via Composer, run:
composer require osmuhin/html-meta
Note
Ensure that the vendor/autoload.php file is required in your code to enable the autoloading mechanism provided by Composer.
Basic usage
Parsing Metadata from URL
use Osmuhin\HtmlMeta\Crawler; $meta = Crawler::init(url: 'https://google.com')->run(); echo $meta->title; // Google
Parsing Metadata from Raw HTML
Instead of URL, you can parse metadata from Raw HTML passing it as a string:
$html = <<<END <html lang="en"> <head> <title>Google</title> <meta charset="UTF-8"> <link rel="icon" href="/favicon.ico"> </head> </html> END; $meta = Crawler::init(html: $html, url: 'https://google.com')->run(); $icon = $meta->favicon->icons[0]; echo $icon->url // https://google.com/favicon.ico
Always pass the
url
parameter when using raw HTML to resolve relative paths correctly.
Using a Custom Request Object
Under the hood, the GuzzleHttp library is used to get html, so you can create your own request object and pass it as a $request
parameter:
$request = new \GuzzleHttp\Psr7\Request('GET', 'https://google.com'); $meta = Crawler::init(request: $request)->run();
All properties of the meta
object are described here.
Configuration
You can customize the crawler’s behavior using its configuration methods:
$crawler = Crawler::init(url: 'https://google.com'); $crawler->config ->dontProcessUrls() ->dontUseTypeConversions() ->processUrlsWith('https://yandex.ru') ->dontUseDefaultDistributorsConfiguration();
Setting | Description |
---|---|
dontProcessUrls() |
Disables the conversion of relative URLs to absolute URLs. |
dontUseTypeConversions() |
Disables automatic type conversions (e.g., string to int): <meta property="og:image:height" content="630"> Using type conversions: int(630) Disabled type conversions: string(3) "630" <meta property="og:image:height" content="630.5"> Using type conversions: null Disabled type conversions: string(5) "630.5" |
processUrlsWith(string $url) |
Sets a base URL for resolving relative paths (automatically enables URL processing). |
dontUseDefaultDistributorsConfiguration() |
Disables the default distributor configuration. |
Core concepts
The Crawler object
The main interaction happens through the $crawler
object of type \Osmuhin\HtmlMeta\Crawler
.
-
Initialization: Configure the crawler before
run()
calling. -
Execution: After
run()
calling, the crawler performs the following steps:-
fetches the HTML string from the URL (if raw HTML is not provided).
The priority of the parameters, if they are more than 1 is following:string $html
➡\GuzzleHttp\Psr7\Request $request
➡string $url
; -
parses the HTML using the configured xpath:
$crawler->xpath = '//html|//html/head/link|//html/head/meta|//html/head/title';
You are free to overwrite xpath property;
-
passes the parsed elements to the distributor stack;
-
the found HTML element is pass to the distributor stack
If the HTML element passed the conditions, then its value is written to DTO (Data Transfer Object) of the type\Osmuhin\HtmlMeta\Contracts\Dto
; -
after parsing the HTML string, the root DTO
\Osmuhin\HtmlMeta\Dto\Meta
is formed in output.
-
Distributors
A Distributor validates HTML elements and distributes their data into DTOs.
Distributor must implement the interface \Osmuhin\HtmlMeta\Contracts\Distributor
and has 2 main methods:
public function canHandle(): bool { } public function handle(): void { }
canHandle()
- Checks whether the distributor can handle the current element.
If returns true, then all sub-distributors are polled, and then the handle method is called.
handle()
- Distributes the HTML element data by DTOs according to its own rules.
You can view the structure of the simplest TitleDistributor distributor:
class TitleDistributor extends \Osmuhin\HtmlMeta\Distributors\AbstractDistributor { public function canHandle(): bool { return $this->el->name === 'title'; } public function handle(): void { $this->meta->title = $this->el->innerText; } }
You are free to replace some kind distributor of your own, example:
use Osmuhin\HtmlMeta\Distributors\TitleDistributor; class MyCustomTitleDistributor extends TitleDistributor { public function handle(): void { $this->meta->title = 'Prefix for title ' . $this->el->innerText; } }
replace original TitleDistributor
in initial configuration:
$crawler = Crawler::init(url: 'https://google.com'); $crawler->distributor->setSubDistributor( MyCustomTitleDistributor::class, TitleDistributor::class ); $meta = $crawler->run(); $meta->title === 'Prefix for title Google';
... or even overwrite the distributors tree completely:
$crawler = Crawler::init(url: 'https://google.com'); $crawler->xpath = '//html/head/title'; $crawler->config->dontUseDefaultDistributorsConfiguration(); $crawler->distributor->useSubDistributors( MyCustomTitleDistributor::init($crawler->container) ); $meta = $crawler->run();
Default distributors configuration
$crawler->distributor->useSubDistributors( \Osmuhin\HtmlMeta\Distributors\HtmlDistributor::init(), \Osmuhin\HtmlMeta\Distributors\TitleDistributor::init(), \Osmuhin\HtmlMeta\Distributors\MetaDistributor::init()->useSubDistributors( \Osmuhin\HtmlMeta\Distributors\HttpEquivDistributor::init(), \Osmuhin\HtmlMeta\Distributors\TwitterDistributor::init(), \Osmuhin\HtmlMeta\Distributors\OpenGraphDistributor::init() ), \Osmuhin\HtmlMeta\Distributors\LinkDistributor::init()->useSubDistributors( \Osmuhin\HtmlMeta\Distributors\LinkRelDistributor::init()->useSubDistributors( \Osmuhin\HtmlMeta\Distributors\FaviconDistributor::init() ) ) );
Contributing
Thank you for considering contributing to this package! Please refer to the Contributing Guidelines for more details.
You can contact me or just come say hi in Telegram: @wischerdson
License
This package is open-sourced software licensed under the MIT license.