johndetomal / browser-channel
PHP library for Browser-based scraping
Requires
- php: ^8.4
- amphp/amp: ^3.1
- amphp/byte-stream: ^2.1
- amphp/http-client: ^5.3
- amphp/http-client-cookies: ^2.0
- amphp/socket: ^2.3
- guzzlehttp/guzzle: ^7.10
- illuminate/support: ^13.4
Requires (Dev)
- fakerphp/faker: ^1.24
- mockery/mockery: ^1.6
- phpunit/phpunit: ^12.5
This package is auto-updated.
Last update: 2026-04-24 02:50:51 UTC
README
๐ฏ Purpose & Practical Use
This project was created primarily as a learning exercise to understand how scalable web content retrieval systems are designed.
It explores real-world concepts such as:
- Multi-driver request handling (HTTP, cURL, headless browser)
- Proxy-aware routing systems
- Driver pooling and lifecycle management
- Fallback execution strategies
- Basic response classification and reliability handling
While the project is not a production scraping product, it is structured in a way that it can serve as a foundation or base engine for more advanced scraping or automation systems.
๐ก Why This Project
Most scraping solutions fail when websites become dynamic or protected.
This engine is designed to improve reliability by combining multiple strategies:
- Lightweight requests for speed
- Browser automation for complex pages
- Intelligent fallback between methods
The goal is simple: maximize success rate while keeping performance efficient.
๐ผ Real-World Value
This project can serve as the foundation for:
- Data collection systems
- Monitoring tools
- Automation pipelines
- Custom scraping APIs
It focuses on reliable data acquisition โ the most critical layer in any scraping workflow.
๐ Browser Scraper Engine
A scalable, driver-based web scraping engine designed to reliably retrieve web page content using multiple strategies such as HTTP, cURL, and headless browser automation.
Built to handle dynamic websites, fallback failures, proxy rotation, and caching in a structured and extensible architecture.
โจ Key Features
- ๐ง Multi-driver system (Curl / HTTP / Browser automation)
- ๐ Automatic fallback between scraping strategies
- ๐ Proxy rotation support (improves success rate)
- โก File-based caching system
- ๐งฉ Modular architecture (easy to extend)
- ๐ Driver-level success/failure tracking
- ๐งช Debug mode for monitoring
๐ Architecture Overview
The system uses a driver-based approach to fetch web pages:
If one method fails, the system automatically retries using alternative drivers.
This improves reliability across different website types and protection levels.
๐ฆ Installation
1. Install PHP dependencies
composer require johndetomal/browser-channel
2. Install Node.js dependencies (for browser driver)
cd node
npm install
๐ Running the Browser Engine
Start the Puppeteer service:
node server.js
Default endpoint: http://localhost:3000
โ๏ธ Basic Usage (Quick Start)
use Browser\Services\Browser\BrowserService; use Browser\Services\Browser\Enum\BrowserDriver; $scraper = new BrowserService([ 'settings' => [ 'driver' => BrowserDriver::Curl, ] ]); $response = $scraper->openPage("https://example.com"); echo $response['content'];
๐ Response Format
[ 'content' => '<html>...</html>', 'status_code' => 200, 'retries' => 1, 'process_start_time' => 458252, 'process_end_time' => 458828252, 'message' => 'success', 'reason' => $reason, 'driver' => $driverType, ]
๐พ Caching System
'settings' => [ 'cache' => true, ]
๐งช Debug Mode
$this->isDebugMode = true;
Debug output includes:
- Driver used
- Proxy used
- Request status
- Success/failure tracking
- Response message
๐ Proxy Configuration
$scraper->proxies([ ['ip' => '127.0.0.1', 'port' => '8080'], ['ip' => '127.0.0.2', 'port' => '8080'], ]);
๐ Fallback System
If the primary method fails, the engine automatically switches between:
- Primary configured driver
- HTTP driver
- cURL driver
- Browser automation driver
This improves reliability across different website structures.
๐งฉ Driver Strategy
Each driver serves a different purpose:
Curl / HTTP โ fast, lightweight requests Browser (Puppeteer) โ full rendering for JavaScript-heavy sites
๐ Scalability & Architecture
This project is designed for scalability.
It uses a modular architecture that allows extension without modifying core logic.
You can extend the system by adding:
- New drivers (e.g. Playwright)
- Custom proxy strategies
- Advanced caching layers
- Enhanced response handling
๐ Use Cases
This engine acts as a data acquisition layer and can be used for:
- Web page content collection
- Data extraction pipelines (with custom parsers)
- Website monitoring and change detection
- Automation workflows
- Research and large-scale data collection
โ ๏ธ Notes
- Puppeteer requires a working Chrome/Chromium environment
- Some Linux servers may require additional dependencies
- Curl/HTTP drivers work without Node.js
๐ค Contributions
This project is open to contributions and improvements.
Developers are welcome to:
- Add new scraping drivers
- Improve proxy rotation and scoring logic
- Enhance caching mechanisms
- Optimize performance and reliability
- Suggest architectural improvements
All constructive feedback is appreciated.
โ ๏ธ Limitations
This system is optimized for public and moderately protected websites.
Performance depends on:
- Website protection level (anti-bot systems)
- Proxy quality
- Request patterns and concurrency
Some heavily protected websites may require additional strategies.