johndetomal/browser-channel

PHP library for Browser-based scraping

Maintainers

Package info

github.com/John-detomal/BrowserDriverEngine

pkg:composer/johndetomal/browser-channel

Statistics

Installs: 0

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

dev-main 2026-04-24 02:52 UTC

This package is auto-updated.

Last update: 2026-04-24 02:50:51 UTC


README

License PHP Node Status

๐ŸŽฏ Purpose & Practical Use

This project was created primarily as a learning exercise to understand how scalable web content retrieval systems are designed.

It explores real-world concepts such as:

  • Multi-driver request handling (HTTP, cURL, headless browser)
  • Proxy-aware routing systems
  • Driver pooling and lifecycle management
  • Fallback execution strategies
  • Basic response classification and reliability handling

While the project is not a production scraping product, it is structured in a way that it can serve as a foundation or base engine for more advanced scraping or automation systems.

๐Ÿ’ก Why This Project

Most scraping solutions fail when websites become dynamic or protected.

This engine is designed to improve reliability by combining multiple strategies:

  • Lightweight requests for speed
  • Browser automation for complex pages
  • Intelligent fallback between methods

The goal is simple: maximize success rate while keeping performance efficient.

๐Ÿ’ผ Real-World Value

This project can serve as the foundation for:

  • Data collection systems
  • Monitoring tools
  • Automation pipelines
  • Custom scraping APIs

It focuses on reliable data acquisition โ€” the most critical layer in any scraping workflow.

๐Ÿš€ Browser Scraper Engine

A scalable, driver-based web scraping engine designed to reliably retrieve web page content using multiple strategies such as HTTP, cURL, and headless browser automation.

Built to handle dynamic websites, fallback failures, proxy rotation, and caching in a structured and extensible architecture.

โœจ Key Features

  • ๐Ÿง  Multi-driver system (Curl / HTTP / Browser automation)
  • ๐Ÿ” Automatic fallback between scraping strategies
  • ๐ŸŒ Proxy rotation support (improves success rate)
  • โšก File-based caching system
  • ๐Ÿงฉ Modular architecture (easy to extend)
  • ๐Ÿ“Š Driver-level success/failure tracking
  • ๐Ÿงช Debug mode for monitoring

๐Ÿ— Architecture Overview

The system uses a driver-based approach to fetch web pages:

If one method fails, the system automatically retries using alternative drivers.

This improves reliability across different website types and protection levels.

๐Ÿ“ฆ Installation

1. Install PHP dependencies

composer require johndetomal/browser-channel

2. Install Node.js dependencies (for browser driver)

cd node
npm install

๐Ÿš€ Running the Browser Engine

Start the Puppeteer service:

node server.js

Default endpoint: http://localhost:3000

โš™๏ธ Basic Usage (Quick Start)

use Browser\Services\Browser\BrowserService;
use Browser\Services\Browser\Enum\BrowserDriver;

$scraper = new BrowserService([
    'settings' => [
        'driver' => BrowserDriver::Curl,
    ]
]);

$response = $scraper->openPage("https://example.com");

echo $response['content'];

๐Ÿ“Š Response Format

[
'content' => '<html>...</html>',
'status_code' => 200,
'retries' => 1,
'process_start_time' => 458252,
'process_end_time' => 458828252,
'message' => 'success',
'reason' => $reason,
'driver' => $driverType,
]

๐Ÿ’พ Caching System

'settings' => [
    'cache' => true,
]

๐Ÿงช Debug Mode

$this->isDebugMode = true;

Debug output includes:

  • Driver used
  • Proxy used
  • Request status
  • Success/failure tracking
  • Response message

๐ŸŒ Proxy Configuration

$scraper->proxies([
    ['ip' => '127.0.0.1', 'port' => '8080'],
    ['ip' => '127.0.0.2', 'port' => '8080'],
]);

๐Ÿ” Fallback System

If the primary method fails, the engine automatically switches between:

  • Primary configured driver
  • HTTP driver
  • cURL driver
  • Browser automation driver

This improves reliability across different website structures.

๐Ÿงฉ Driver Strategy

Each driver serves a different purpose:

Curl / HTTP โ†’ fast, lightweight requests Browser (Puppeteer) โ†’ full rendering for JavaScript-heavy sites

๐Ÿ“ˆ Scalability & Architecture

This project is designed for scalability.

It uses a modular architecture that allows extension without modifying core logic.

You can extend the system by adding:

  • New drivers (e.g. Playwright)
  • Custom proxy strategies
  • Advanced caching layers
  • Enhanced response handling

๐Ÿ“Œ Use Cases

This engine acts as a data acquisition layer and can be used for:

  • Web page content collection
  • Data extraction pipelines (with custom parsers)
  • Website monitoring and change detection
  • Automation workflows
  • Research and large-scale data collection

โš ๏ธ Notes

  • Puppeteer requires a working Chrome/Chromium environment
  • Some Linux servers may require additional dependencies
  • Curl/HTTP drivers work without Node.js

๐Ÿค Contributions

This project is open to contributions and improvements.

Developers are welcome to:

  • Add new scraping drivers
  • Improve proxy rotation and scoring logic
  • Enhance caching mechanisms
  • Optimize performance and reliability
  • Suggest architectural improvements

All constructive feedback is appreciated.

โš ๏ธ Limitations

This system is optimized for public and moderately protected websites.

Performance depends on:

  • Website protection level (anti-bot systems)
  • Proxy quality
  • Request patterns and concurrency

Some heavily protected websites may require additional strategies.