README

The content import process is very complicated and unpredictable. Especially the crawling phase. And the main reason for its complexity is that there is a vast amount of different possible scenarios:

Some sites might run on insecure HTTP protocol/they might use an invalid SSL certificate.
There might be some broken links on some pages.
Some source sites might have strange redirects that need to be ignored/followed.
For some sites, the crawler needs to obey special limits: requests/second, timeout, etc.
Sometimes not all the links need to be crawled, but specific ones: links in the menus, links in the main content, etc.
Some site sections should be ignored from being crawled, or crawling needs to be done just against specific site sections.
Crawling the site might involve multiple domains/protocols.

And it is pretty obvious it is unrealistic to create a crawler that will handle all possible cases out of the box. That's why we focused on creating a flexible crawler. So it allows handling all the listed scenarios by using the provided configurations.

This crawler stores the data in persistent storage (database). And it was designed to be used in the content imports. But it is a separate component that might be used in any other use case. It's the only purpose is to crawl the site and stores its pages metadata in the persistent storage. And the metadata of the crawled pages could be used for any purpose: import/analyze/any custom functionality.

Installation

Require contextualcode/crawler via composer:

 composer require contextualcode/crawler

Run the migration:

 php bin/console doctrine:migrations:migrate --configuration=vendor/contextualcode/crawler/src/Resources/config/doctrine_migrations.yaml --no-interaction

Usage

This section describes the basic usage concepts. Please check usage example and reference pages for technical details.

The usage flow is the following:

Implement crawler handler.

It should be a PHP class which extends ContextualCode\Crawler\Service\Handler. It has a lot of flexible configuration options described in the reference. The simplest crawler handler needs to provide import identifier and site domain:

 <?php

 namespace App\ContentImport;

 use ContextualCode\Crawler\Service\Handler as BaseCrawlerHandler;

 class CrawlerHandler extends BaseCrawlerHandler
 {
     public function getImportIdentifier(): string
     {
         return 'unique-identifier';
     }

     public function getDomain(): string
     {
         return 'www.site-to-crawl.com';
     }
 }

Run the crawler:run command.

This command requires the only argument: crawler identifier defined on the previous step. More detailed description for this command is available in the reference:

 php bin/console crawler:run unique-identifier

  To get live logs, please run the following command in a new terminal:
  tail -f /XXX/var/log/contextualcode-crawler.log

 Running the crawler ...
 =======================

 Url: http://www.site-to-crawl.com/
 Referer:

 282/282 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% 7 secs/7 secs

 All links are processed:
  * 281 valid links
  * 1 invalid links

Analyze and use crawled pages metadata.

The command from the previous step will populate ContextualCode\Crawler\Entity\Page entities in the database. They could be used for the content import or any other custom functionality. Detailed explanation about what data is stored in those entities is available in the reference.

contextualcode / crawler

Maintainers

Details

README

Installation

Usage