shammaa / laravel-smart-scraper
Advanced intelligent web scraper for Laravel with caching, rate limiting, middleware, monitoring, and much more. Built on Puppeteer with smart features.
Installs: 2
Dependents: 0
Suggesters: 0
Security: 0
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/shammaa/laravel-smart-scraper
Requires
- php: ^8.1
- illuminate/console: ^9.0|^10.0|^11.0|^12.0
- illuminate/queue: ^9.0|^10.0|^11.0|^12.0
- illuminate/support: ^9.0|^10.0|^11.0|^12.0
- symfony/css-selector: ^5.0|^6.0|^7.0
- symfony/dom-crawler: ^5.0|^6.0|^7.0
Requires (Dev)
- orchestra/testbench: ^8.0|^9.0
- phpunit/phpunit: ^10.0
README
Advanced intelligent web scraper for Laravel with caching, rate limiting, middleware, monitoring, and much more. Built on Puppeteer with smart features that make web scraping professional, efficient, and reliable.
🚀 Key Features
- ✅ Intelligent Caching - Automatic caching to avoid redundant requests
- ✅ Rate Limiting - Prevent overwhelming target websites
- ✅ User-Agent Rotation - Rotate user agents automatically to avoid detection
- ✅ Middleware System - Extensible middleware for request manipulation
- ✅ Automatic Retry - Exponential backoff retry logic for failed requests
- ✅ Screenshot & PDF - Capture screenshots and generate PDFs
- ✅ Proxy Support - Full proxy support with authentication
- ✅ Monitoring & Logging - Comprehensive monitoring and logging
- ✅ Schema Validation - Validate extracted data against schemas
- ✅ Concurrent Scraping - Scrape multiple URLs concurrently
- ✅ Queue Support - Process scraping jobs in background
- ✅ Error Handling - Robust error handling and recovery
- ✅ Smart Site Detection - Automatically detect site type and use appropriate selectors
- ✅ Multi-Site Support - Handle multiple websites with different HTML structures intelligently
📦 Installation
Install the package via Composer:
composer require shammaa/laravel-smart-scraper
Publish the configuration file:
php artisan vendor:publish --tag=smart-scraper-config
Install Node.js dependencies (required for Puppeteer):
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Note: Make sure Node.js is installed and available in your PATH. If using NVM, see the NVM Configuration section below.
🎯 Quick Start
1. Create a Scraper
Generate a scraper class using the Artisan command:
php artisan make:scraper ProductScraper
This creates a file at app/Scrapers/ProductScraper.php:
<?php namespace App\Scrapers; use Shammaa\LaravelSmartScraper\Scraper; class ProductScraper extends Scraper { protected function handle(): array { $crawler = $this->getCrawler(); return [ 'title' => $crawler->filter('h1')->text(''), 'price' => $crawler->filter('.price')->text(''), 'description' => $crawler->filter('.description')->text(''), ]; } }
2. Use the Scraper
use App\Scrapers\ProductScraper; $data = ProductScraper::scrape('https://example.com/product/123') ->timeout(10000) ->run(); dd($data);
📚 Basic Usage
Simple Scraping
use App\Scrapers\ProductScraper; $data = ProductScraper::scrape('https://example.com/product/123')->run();
With Options
$data = ProductScraper::scrape('https://example.com/product/123') ->timeout(20000) // 20 seconds timeout ->proxy('ip:port', 'user', 'pass') // Use proxy ->headers(['Accept-Language' => 'en']) // Custom headers ->retry(3, 5) // Retry 3 times, wait 5 seconds ->cache(false) // Disable caching ->run();
With Parameters
You can pass parameters to the handle() method:
<?php namespace App\Scrapers; use Shammaa\LaravelSmartScraper\Scraper; class ProductScraper extends Scraper { protected function handle(string $selector = 'h1'): array { $crawler = $this->getCrawler(); return [ 'title' => $crawler->filter($selector)->text(''), ]; } }
Then use it:
$data = ProductScraper::scrape('https://example.com/product/123') ->run(selector: '.product-title');
🔧 Advanced Features
Caching
The scraper automatically caches results to avoid redundant requests:
// Enable caching (default) $data = ProductScraper::scrape('https://example.com/product/123') ->cache(true) ->run(); // Disable caching $data = ProductScraper::scrape('https://example.com/product/123') ->cache(false) ->run();
Cache TTL can be configured in config/smart-scraper.php:
'cache' => [ 'ttl' => 3600, // 1 hour ],
Rate Limiting
Prevent overwhelming target websites with rate limiting:
// Rate limiting is enabled by default $data = ProductScraper::scrape('https://example.com/product/123')->run(); // Disable rate limiting $data = ProductScraper::scrape('https://example.com/product/123') ->rateLimit(false) ->run();
Configure rate limits in config/smart-scraper.php:
'rate_limit' => [ 'enabled' => true, 'max_requests' => 10, // Max 10 requests 'per_seconds' => 60, // Per 60 seconds ],
User-Agent Rotation
User agents are automatically rotated to avoid detection:
// Rotation is enabled by default $data = ProductScraper::scrape('https://example.com/product/123')->run();
Configure user agents in config/smart-scraper.php:
'user_agent' => [ 'rotation_enabled' => true, 'agents' => [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...', // Add more user agents ], ],
Proxy Support
Full proxy support with authentication:
// Without authentication $data = ProductScraper::scrape('https://example.com/product/123') ->proxy('200.20.14.84:40200') ->run(); // With authentication $data = ProductScraper::scrape('https://example.com/product/123') ->proxy('200.20.14.84:40200', 'username', 'password') ->run();
Retry Logic
Automatic retry with exponential backoff:
// Retry 3 times, wait 1 second between attempts $data = ProductScraper::scrape('https://example.com/product/123') ->retry(3, 1) ->run();
Configure default retry settings in config/smart-scraper.php:
'retry' => [ 'enabled' => true, 'max_attempts' => 3, 'initial_delay' => 1, // seconds 'max_delay' => 60, // seconds 'backoff_multiplier' => 2, // Exponential backoff 'retryable_status_codes' => [408, 429, 500, 502, 503, 504], ],
Screenshots
Capture screenshots of web pages:
// Save screenshot to file $data = ProductScraper::scrape('https://example.com/product/123') ->screenshot(true, storage_path('app/screenshots/product.png')) ->run(); // Get screenshot as base64 $data = ProductScraper::scrape('https://example.com/product/123') ->screenshot(true) ->run(); $screenshotBase64 = $data['screenshot'] ?? null;
PDF Generation
Generate PDFs from web pages:
// Save PDF to file $data = ProductScraper::scrape('https://example.com/product/123') ->pdf(true, storage_path('app/pdfs/product.pdf')) ->run(); // Get PDF as base64 $data = ProductScraper::scrape('https://example.com/product/123') ->pdf(true) ->run(); $pdfBase64 = $data['pdf'] ?? null;
Custom Headers
Add custom headers to requests:
$data = ProductScraper::scrape('https://example.com/product/123') ->headers([ 'Accept-Language' => 'en-US,en;q=0.9', 'Accept' => 'text/html,application/xhtml+xml', 'X-Custom-Header' => 'value', ]) ->run();
Data Validation
Validate extracted data against a schema:
use Shammaa\LaravelSmartScraper\Services\SchemaValidatorService; $data = ProductScraper::scrape('https://example.com/product/123') ->validate(function ($data) { $validator = new SchemaValidatorService(); return $validator->validate($data, [ 'title' => ['required' => true, 'type' => 'string'], 'price' => ['required' => true, 'type' => 'string'], 'description' => ['required' => false, 'type' => 'string'], ]); }) ->run();
Middleware
Create custom middleware to modify requests:
use Shammaa\LaravelSmartScraper\Contracts\MiddlewareInterface; class CustomHeaderMiddleware implements MiddlewareInterface { public function handle(array $options): array { $options['headers']['X-Custom'] = 'value'; return $options; } } // Use middleware $data = ProductScraper::scrape('https://example.com/product/123') ->middleware(new CustomHeaderMiddleware()) ->run();
🎨 Artisan Commands
Create a Scraper
php artisan make:scraper ProductScraper
List All Scrapers
php artisan list:scrapers
Test a Scraper
php artisan scraper:test "App\Scrapers\ProductScraper" "https://example.com/product/123"
🧠 Smart Site Detection & Multi-Site Support
The scraper can intelligently detect different websites and automatically use the appropriate selectors for each site. This means you can scrape multiple websites with different HTML structures using the same scraper class!
How It Works
- Automatic Site Detection - The scraper detects the site type from URL patterns or HTML patterns
- Smart Selectors - Uses site-specific selectors if available, falls back to generic selectors
- Fallback System - If a selector fails, it automatically tries the next one
Creating a Smart Scraper
Use the --smart flag when creating a scraper:
php artisan make:scraper ProductScraper --smart
This creates a scraper with smart selectors:
<?php namespace App\Scrapers; use Shammaa\LaravelSmartScraper\Scraper; class ProductScraper extends Scraper { protected function handle(): array { $smart = $this->smart(); return [ // Smart extraction - tries multiple selectors automatically 'title' => $smart->extract('title', [ 'h1', '.title', '[itemprop="name"]', 'title', ]), 'price' => $smart->extract('price', [ '.price', '[itemprop="price"]', '.amount', '.cost', ]), 'image' => $smart->extractAttribute('image', [ 'img.main-image', '.product-image img', '[itemprop="image"]', 'img', ], 'src'), 'description' => $smart->extract('description', [ '.description', '[itemprop="description"]', '.content', 'p', ]), ]; } }
Smart Selector Methods
extract() - Extract text content
Tries multiple selectors until one works:
$title = $smart->extract('title', [ 'h1.product-title', 'h1', '.title', '[itemprop="name"]', ], 'Default Title');
extractAttribute() - Extract attribute value
Tries multiple selectors to extract an attribute:
$image = $smart->extractAttribute('image', [ 'img.main-image', '.product-image img', '[itemprop="image"]', ], 'src', 'default.jpg');
extractMultiple() - Extract array of values
Extracts multiple elements:
$tags = $smart->extractMultiple('tags', [ '.tag', '.tags a', '[itemprop="keywords"]', ], function ($node) { return $node->text(); });
Site Profiles
Define site profiles in config/smart-scraper.php:
'site_profiles' => [ 'amazon' => [ 'url_patterns' => [ '/amazon\.(com|co\.uk|de|fr|it|es|ca|com\.au)/', ], 'html_patterns' => [ '#nav-logo' => null, '[data-asin]' => null, ], 'selectors' => [ 'title' => [ '#productTitle', 'h1.a-size-large', 'h1', ], 'price' => [ '.a-price .a-offscreen', '#priceblock_dealprice', '#priceblock_saleprice', ], ], ], 'ebay' => [ 'url_patterns' => [ '/ebay\.(com|co\.uk|de|fr|it|es|ca|com\.au)/', ], 'html_patterns' => [ '#gh-logo' => null, '[data-testid="x-item-title-label"]' => null, ], 'selectors' => [ 'title' => [ 'h1[data-testid="x-item-title-label"]', 'h1.it-ttl', 'h1', ], 'price' => [ '.notranslate', '.u-flL.condText', ], ], ], ],
How Site Detection Works
- URL Pattern Matching - First, tries to match URL patterns
- HTML Pattern Matching - If URL doesn't match, analyzes HTML structure
- Selector Priority - Uses site-specific selectors first, then falls back to generic ones
Example: Scraping Multiple Sites
use App\Scrapers\ProductScraper; // Works with Amazon $amazonData = ProductScraper::scrape('https://amazon.com/product/123')->run(); // Works with eBay $ebayData = ProductScraper::scrape('https://ebay.com/itm/123')->run(); // Works with any e-commerce site $genericData = ProductScraper::scrape('https://example-shop.com/product/123')->run();
The same scraper automatically adapts to each site's structure!
Manual Site Type Detection
You can also manually set or check the site type:
protected function handle(): array { $siteType = $this->getSiteType(); // 'amazon', 'ebay', null, etc. if ($siteType === 'amazon') { // Amazon-specific logic } elseif ($siteType === 'ebay') { // eBay-specific logic } // Or set manually $this->setSiteType('custom-site'); return []; }
🔍 Monitoring & Logging
Monitoring is enabled by default. All scraping activities are logged:
// Logs are automatically created $data = ProductScraper::scrape('https://example.com/product/123')->run();
Check logs in storage/logs/laravel.log:
[2024-01-01 12:00:00] local.INFO: Scraping started {"url":"https://example.com/product/123",...}
[2024-01-01 12:00:02] local.INFO: Scraping completed {"url":"https://example.com/product/123","duration":"2.5s",...}
Configure monitoring in config/smart-scraper.php:
'monitoring' => [ 'enabled' => true, 'log_channel' => 'stack', 'track_metrics' => true, ],
⚙️ Configuration
All configuration options are available in config/smart-scraper.php:
return [ 'cache' => [ 'driver' => 'file', 'ttl' => 3600, 'prefix' => 'smart_scraper', ], 'rate_limit' => [ 'enabled' => true, 'max_requests' => 10, 'per_seconds' => 60, ], 'puppeteer' => [ 'node_path' => 'node', 'script_path' => __DIR__ . '/../resources/js/scraper.js', 'timeout' => 30000, 'headless' => true, ], // ... more options ];
🐛 Troubleshooting
NVM Configuration
If you're using Node.js via NVM and running scrapers via scheduled tasks, Node might not be available. To fix this:
- Edit your
~/.bash_profile:
nano ~/.bash_profile
- Add at the top:
export NVM_DIR="$HOME/.nvm" [ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" [ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"
- Reload:
source ~/.bash_profile
Note: It's not recommended to use NVM in production environments.
Common Issues
Issue: Puppeteer execution failed
Solution: Make sure Node.js and Puppeteer dependencies are installed:
node --version npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
Issue: Rate limit exceeded
Solution: Adjust rate limit settings or disable rate limiting:
->rateLimit(false)
Issue: Data validation failed
Solution: Check your validation schema and ensure data matches expected types.
📝 Examples
Example 1: E-commerce Product Scraper
<?php namespace App\Scrapers; use Shammaa\LaravelSmartScraper\Scraper; class ProductScraper extends Scraper { protected function handle(): array { $crawler = $this->getCrawler(); return [ 'title' => $crawler->filter('h1.product-title')->text(''), 'price' => $crawler->filter('.price')->text(''), 'currency' => $crawler->filter('.currency')->text(''), 'description' => $crawler->filter('.product-description')->text(''), 'images' => $crawler->filter('.product-images img')->each(function ($node) { return $node->attr('src'); }), 'rating' => $crawler->filter('.rating')->text(''), 'reviews_count' => $crawler->filter('.reviews-count')->text(''), ]; } } // Usage $data = ProductScraper::scrape('https://example.com/product/123') ->timeout(15000) ->retry(3, 2) ->run();
Example 2: News Article Scraper
<?php namespace App\Scrapers; use Shammaa\LaravelSmartScraper\Scraper; class NewsScraper extends Scraper { protected function handle(): array { $crawler = $this->getCrawler(); return [ 'title' => $crawler->filter('h1.article-title')->text(''), 'author' => $crawler->filter('.article-author')->text(''), 'published_at' => $crawler->filter('.article-date')->attr('datetime'), 'content' => $crawler->filter('.article-content')->html(), 'tags' => $crawler->filter('.article-tags a')->each(function ($node) { return $node->text(); }), 'image' => $crawler->filter('.article-image img')->attr('src'), ]; } } // Usage with screenshot $data = NewsScraper::scrape('https://example.com/news/article-123') ->screenshot(true, storage_path('app/screenshots/article.png')) ->run();
Example 3: Multiple URLs Concurrently
use App\Scrapers\ProductScraper; use Shammaa\LaravelSmartScraper\Services\ConcurrentScraperService; $urls = [ 'https://example.com/product/1', 'https://example.com/product/2', 'https://example.com/product/3', ]; $concurrentScraper = new ConcurrentScraperService(maxConcurrent: 5); $results = $concurrentScraper->scrape($urls, function ($url) { return ProductScraper::scrape($url)->run(); }); foreach ($results as $url => $data) { echo "Scraped: {$url}\n"; print_r($data); }
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📄 License
This package is open-sourced software licensed under the MIT license.
🙏 Credits
Built with ❤️ by Shadi Shammaa
Made with Laravel Smart Scraper - Professional web scraping made easy! 🚀