bitandblack/document-crawler

Extract different parts of an HTML or XML document.

Installs: 4

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 0

Forks: 0

Open Issues: 0

pkg:composer/bitandblack/document-crawler

dev-main 2025-11-27 21:13 UTC

This package is auto-updated.

Last update: 2025-11-27 21:20:12 UTC


README

PHP from Packagist Latest Stable Version Total Downloads License

Bit&Black Logo

Bit&Black Document Crawler

Extract different parts of an HTML or XML document.

Installation

This library is made for the use with Composer. Add it to your project by running $ composer require bitandblack/document-crawler.

Usage

Using Crawlers to extract parts of a document

The Bit&Black Document Crawler library provides different crawlers, to extract information of a document. There are currently existing:

  • IconsCrawler: Crawl and extract all defined icons in a document, that have been declared with <link rel="icon" ... />.
  • ImagesCrawler: Crawl and extract all defined images in a document, that have been declared with <img ... />.
  • LanguageCodeCrawler: Crawl and extract the language code of a document, that has been declared with <html lang="...">.
  • MetaTagsCrawler: Crawl and extract all defined meta tags in a document, that have been declared with <meta ... />.
  • TitleCrawler: Crawl and extract the title of a document, that has been declared with <title>...</title>.

All those crawlers work the same — they need a Dom Crawler object, that contains the document:

<?php

use BitAndBlack\DocumentCrawler\ContentCrawler\TitleCrawler;
use Symfony\Component\DomCrawler\Crawler;

$document = <<<HTML
<!doctype html>
<html lang="en">
    <head>
        <title>Test</title>
    </head>
    <body>
        <h1>Hello world</h1>
    </body>
</html>
HTML;

$crawler = new Crawler($document);

$titleCrawler = new TitleCrawler($crawler);
$titleCrawler->crawlContent();

// This will output `Test`.
echo $titleCrawler->getTitle();

You can create a custom Crawler by implementing the CrawlerInterface.

Handling resources

In same cases, resources are getting crawled, which you may want to handle in a specific way. To achieve this, each crawler makes use of a so-called Resource Handler. There are currently existing:

You can create a custom Resource Handler by implementing the ResourceHandlerInterface.

Crawling everything at once

In case you don't want to setup something, there is the HolisticDocumentCrawler, that does all the work for you:

<?php

use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler;

$holisticDocumentCrawler = new HolisticDocumentCrawler('https://www.bitandblack.com');

// Get all icons:
$icons = $holisticDocumentCrawler->getIcons();

// Get all images:
$images = $holisticDocumentCrawler->getImages();

// Get the language code:
$languageCode = $holisticDocumentCrawler->getLanguageCode();

// Get all meta tags:
$metaTags = $holisticDocumentCrawler->getMetaTags();

// Get the title:
$title = $holisticDocumentCrawler->getTitle();

Help

If you have any questions, feel free to contact us under hello@bitandblack.com.

Further information about Bit&Black can be found under www.bitandblack.com.