marioungui/php-component-spider

a PHP package for scraping Brands Websites

v0.7.2 2024-07-22 21:39 UTC

README

License: MIT CodeFactor Latest Stable Version License PHAR Build

This PHP Component Spider is designed to scrape websites for specific components or search criteria defined by XPath filters. It uses the PHPScraper library to fetch and process web pages, and the League\Csv library to log the results in CSV files. This tool is easy to extend with custom XPath filters to meet various scraping needs.

Features

  • Scrape websites for specific components or text based on XPath filters.
  • Log results into CSV files for further analysis.
  • Configurable timeout and maximum redirects.
  • Easy to extend with additional filters.

Requirements

  • PHP 8.1 or higher
  • Composer

Build & Run from Source Code

  1. Clone the repository:
git clone https://github.com/marioungui/PHP-Component-Spider.git
  1. Navigate to the project directory:
cd PHP-Component-Spider
  1. Install the dependencies using Composer:
composer install
  1. Build the Phar package:
php -d phar.readonly=0 phar-creator.php
  1. Run the batch spider.bat
  2. Follow the on-screen instructions to select the component to search for and the domain to scrape.

Filters

The filters are defined in filters.php and use XPath to identify specific components on the web pages. Here are the current filters available:

Extending with Custom Filters

Extending the tool with new filters is simple:

  1. Open the filters.php file.
  2. Add a new case in the switch statement with your component name or index.
  3. Define the $component and $filter variables with your custom XPath.

Example:

case 'new-component':
case 11:
    $component = "New Component";
    $filter = "//*[@class='new-component-class']";
    break;

Contributing

Feel free to submit issues or pull requests if you have any improvements or new features you'd like to add.

License

This project is licensed under the MIT License.