Flexible website crawler which stores the results in persistent storage

Installs: 674

Dependents: 1

Suggesters: 0

Security: 0

Stars: 0

Forks: 1

Type:symfony-bundle

v2.4.0 2024-02-20 23:27 UTC

README

The content import process is very complicated and unpredictable. Especially the crawling phase. And the main reason for its complexity is that there is a vast amount of different possible scenarios:

  • Some sites might run on insecure HTTP protocol/they might use an invalid SSL certificate.
  • There might be some broken links on some pages.
  • Some source sites might have strange redirects that need to be ignored/followed.
  • For some sites, the crawler needs to obey special limits: requests/second, timeout, etc.
  • Sometimes not all the links need to be crawled, but specific ones: links in the menus, links in the main content, etc.
  • Some site sections should be ignored from being crawled, or crawling needs to be done just against specific site sections.
  • Crawling the site might involve multiple domains/protocols.

And it is pretty obvious it is unrealistic to create a crawler that will handle all possible cases out of the box. That's why we focused on creating a flexible crawler. So it allows handling all the listed scenarios by using the provided configurations.

This crawler stores the data in persistent storage (database). And it was designed to be used in the content imports. But it is a separate component that might be used in any other use case. It's the only purpose is to crawl the site and stores its pages metadata in the persistent storage. And the metadata of the crawled pages could be used for any purpose: import/analyze/any custom functionality.

Installation

  1. Require contextualcode/crawler via composer:

     composer require contextualcode/crawler
    
  2. Run the migration:

     php bin/console doctrine:migrations:migrate --configuration=vendor/contextualcode/crawler/src/Resources/config/doctrine_migrations.yaml --no-interaction
    

Usage

This section describes the basic usage concepts. Please check usage example and reference pages for technical details.

The usage flow is the following:

  1. Implement crawler handler.

    It should be a PHP class which extends ContextualCode\Crawler\Service\Handler. It has a lot of flexible configuration options described in the reference. The simplest crawler handler needs to provide import identifier and site domain:

     <?php
    
     namespace App\ContentImport;
    
     use ContextualCode\Crawler\Service\Handler as BaseCrawlerHandler;
    
     class CrawlerHandler extends BaseCrawlerHandler
     {
         public function getImportIdentifier(): string
         {
             return 'unique-identifier';
         }
    
         public function getDomain(): string
         {
             return 'www.site-to-crawl.com';
         }
     }
    
  2. Run the crawler:run command.

    This command requires the only argument: crawler identifier defined on the previous step. More detailed description for this command is available in the reference:

     php bin/console crawler:run unique-identifier
    
      To get live logs, please run the following command in a new terminal:
      tail -f /XXX/var/log/contextualcode-crawler.log
    
     Running the crawler ...
     =======================
    
     Url: http://www.site-to-crawl.com/
     Referer:
    
     282/282 [▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓] 100% 7 secs/7 secs
    
     All links are processed:
      * 281 valid links
      * 1 invalid links
    
  3. Analyze and use crawled pages metadata.

    The command from the previous step will populate ContextualCode\Crawler\Entity\Page entities in the database. They could be used for the content import or any other custom functionality. Detailed explanation about what data is stored in those entities is available in the reference.