bitandblack / document-crawler
Extract different parts of an HTML or XML document.
Installs: 4
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/bitandblack/document-crawler
Requires
- php: ^8.2
- bitandblack/composer-helper: ^2.0
- bitandblack/pathinfo: ^1.0
- fig/http-message-util: ^1.0
- php-http/discovery: ^1.0
- places2be/locales: ^3.3
- psr/http-client: ^1.0
- symfony/css-selector: 7.0 || ^8.0
- symfony/dom-crawler: ^7.0 || ^8.0
Requires (Dev)
- bitandblack/helpers: ^2.0
- nyholm/psr7: ^1.8
- phpstan/phpstan: ^2.0
- phpunit/phpunit: ^11.0
- react/http: ^1.0
- rector/rector: ^2.0
- symfony/http-client: ^7.0 || ^8.0
- symfony/var-dumper: ^7.0 || ^8.0
- symplify/easy-coding-standard: ^13.0
This package is auto-updated.
Last update: 2025-11-27 21:20:12 UTC
README
Bit&Black Document Crawler
Extract different parts of an HTML or XML document.
Installation
This library is made for the use with Composer. Add it to your project by running $ composer require bitandblack/document-crawler.
Usage
Using Crawlers to extract parts of a document
The Bit&Black Document Crawler library provides different crawlers, to extract information of a document. There are currently existing:
- IconsCrawler: Crawl and extract all defined icons in a document, that have been declared with
<link rel="icon" ... />. - ImagesCrawler: Crawl and extract all defined images in a document, that have been declared with
<img ... />. - LanguageCodeCrawler: Crawl and extract the language code of a document, that has been declared with
<html lang="...">. - MetaTagsCrawler: Crawl and extract all defined meta tags in a document, that have been declared with
<meta ... />. - TitleCrawler: Crawl and extract the title of a document, that has been declared with
<title>...</title>.
All those crawlers work the same — they need a Dom Crawler object, that contains the document:
<?php use BitAndBlack\DocumentCrawler\ContentCrawler\TitleCrawler; use Symfony\Component\DomCrawler\Crawler; $document = <<<HTML <!doctype html> <html lang="en"> <head> <title>Test</title> </head> <body> <h1>Hello world</h1> </body> </html> HTML; $crawler = new Crawler($document); $titleCrawler = new TitleCrawler($crawler); $titleCrawler->crawlContent(); // This will output `Test`. echo $titleCrawler->getTitle();
You can create a custom Crawler by implementing the CrawlerInterface.
Handling resources
In same cases, resources are getting crawled, which you may want to handle in a specific way. To achieve this, each crawler makes use of a so-called Resource Handler. There are currently existing:
-
The FileSystemDownloadHandler: This one loads resources and writes them to the file system. There are different Downloaders available to fetch resources:
- The HttpDiscoveryDownloader is the default one and makes use of whatever library your project uses to download resources.
- The ReactDownloader needs the
react/httplibrary and fetches resources asynchronously. - You can — for sure — create a custom Downloader by implementing the FileSystemDownloaderInterface.
-
The PassiveResourceHandler: This handler does nothing and is the default one.
You can create a custom Resource Handler by implementing the ResourceHandlerInterface.
Crawling everything at once
In case you don't want to setup something, there is the HolisticDocumentCrawler, that does all the work for you:
<?php use BitAndBlack\DocumentCrawler\HolisticDocumentCrawler; $holisticDocumentCrawler = new HolisticDocumentCrawler('https://www.bitandblack.com'); // Get all icons: $icons = $holisticDocumentCrawler->getIcons(); // Get all images: $images = $holisticDocumentCrawler->getImages(); // Get the language code: $languageCode = $holisticDocumentCrawler->getLanguageCode(); // Get all meta tags: $metaTags = $holisticDocumentCrawler->getMetaTags(); // Get the title: $title = $holisticDocumentCrawler->getTitle();
Help
If you have any questions, feel free to contact us under hello@bitandblack.com.
Further information about Bit&Black can be found under www.bitandblack.com.