contextualcode / crawler
Flexible website crawler which stores the results in persistent storage
Requires
- php: >=7.3.0
- ext-curl: *
- ext-dom: *
- ext-json: *
- ext-libxml: *
- ext-mbstring: *
- doctrine/doctrine-bundle: ^2.1
- doctrine/doctrine-migrations-bundle: ^3.0
- doctrine/orm: ^2.7
- monolog/monolog: ^2.0
- symfony/console: >=5.0
- symfony/doctrine-bridge: >=5.0
- symfony/monolog-bundle: ^3.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^2.16
README
The content import process is very complicated and unpredictable. Especially the crawling phase. And the main reason for its complexity is that there is a vast amount of different possible scenarios:
- Some sites might run on insecure HTTP protocol/they might use an invalid SSL certificate.
- There might be some broken links on some pages.
- Some source sites might have strange redirects that need to be ignored/followed.
- For some sites, the crawler needs to obey special limits: requests/second, timeout, etc.
- Sometimes not all the links need to be crawled, but specific ones: links in the menus, links in the main content, etc.
- Some site sections should be ignored from being crawled, or crawling needs to be done just against specific site sections.
- Crawling the site might involve multiple domains/protocols.
And it is pretty obvious it is unrealistic to create a crawler that will handle all possible cases out of the box. That's why we focused on creating a flexible crawler. So it allows handling all the listed scenarios by using the provided configurations.
This crawler stores the data in persistent storage (database). And it was designed to be used in the content imports. But it is a separate component that might be used in any other use case. It's the only purpose is to crawl the site and stores its pages metadata in the persistent storage. And the metadata of the crawled pages could be used for any purpose: import/analyze/any custom functionality.
Installation
Require
contextualcode/crawler
viacomposer
:composer require contextualcode/crawler
Run the migration:
php bin/console doctrine:migrations:migrate --configuration=vendor/contextualcode/crawler/src/Resources/config/doctrine_migrations.yaml --no-interaction
Usage
This section describes the basic usage concepts. Please check usage example and reference pages for technical details.
The usage flow is the following:
Implement crawler handler.
It should be a PHP class which extends
ContextualCode\Crawler\Service\Handler
. It has a lot of flexible configuration options described in the reference. The simplest crawler handler needs to provide import identifier and site domain:<?php namespace App\ContentImport; use ContextualCode\Crawler\Service\Handler as BaseCrawlerHandler; class CrawlerHandler extends BaseCrawlerHandler { public function getImportIdentifier(): string { return 'unique-identifier'; } public function getDomain(): string { return 'www.site-to-crawl.com'; } }
Run the
crawler:run
command.This command requires the only argument: crawler identifier defined on the previous step. More detailed description for this command is available in the reference:
php bin/console crawler:run unique-identifier To get live logs, please run the following command in a new terminal: tail -f /XXX/var/log/contextualcode-crawler.log Running the crawler ... ======================= Url: http://www.site-to-crawl.com/ Referer: 282/282 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% 7 secs/7 secs All links are processed: * 281 valid links * 1 invalid links
Analyze and use crawled pages metadata.
The command from the previous step will populate
ContextualCode\Crawler\Entity\Page
entities in the database. They could be used for the content import or any other custom functionality. Detailed explanation about what data is stored in those entities is available in the reference.