README

Advanced intelligent web scraper for Laravel with caching, rate limiting, middleware, monitoring, and much more. Built on Puppeteer with smart features that make web scraping professional, efficient, and reliable.

🚀 Key Features

✅ Intelligent Caching - Automatic caching to avoid redundant requests
✅ Rate Limiting - Prevent overwhelming target websites
✅ User-Agent Rotation - Rotate user agents automatically to avoid detection
✅ Middleware System - Extensible middleware for request manipulation
✅ Automatic Retry - Exponential backoff retry logic for failed requests
✅ Screenshot & PDF - Capture screenshots and generate PDFs
✅ Proxy Support - Full proxy support with authentication
✅ Monitoring & Logging - Comprehensive monitoring and logging
✅ Schema Validation - Validate extracted data against schemas
✅ Concurrent Scraping - Scrape multiple URLs concurrently
✅ Queue Support - Process scraping jobs in background
✅ Error Handling - Robust error handling and recovery
✅ Smart Site Detection - Automatically detect site type and use appropriate selectors
✅ Multi-Site Support - Handle multiple websites with different HTML structures intelligently

📦 Installation

Install the package via Composer:

composer require shammaa/laravel-smart-scraper

Publish the configuration file:

php artisan vendor:publish --tag=smart-scraper-config

Install Node.js dependencies (required for Puppeteer):

npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

Note: Make sure Node.js is installed and available in your PATH. If using NVM, see the NVM Configuration section below.

🎯 Quick Start

1. Create a Scraper

Generate a scraper class using the Artisan command:

php artisan make:scraper ProductScraper

This creates a file at app/Scrapers/ProductScraper.php:

<?php

namespace App\Scrapers;

use Shammaa\LaravelSmartScraper\Scraper;

class ProductScraper extends Scraper
{
    protected function handle(): array
    {
        $crawler = $this->getCrawler();

        return [
            'title' => $crawler->filter('h1')->text(''),
            'price' => $crawler->filter('.price')->text(''),
            'description' => $crawler->filter('.description')->text(''),
        ];
    }
}

2. Use the Scraper

use App\Scrapers\ProductScraper;

$data = ProductScraper::scrape('https://example.com/product/123')
    ->timeout(10000)
    ->run();

dd($data);

📚 Basic Usage

Simple Scraping

use App\Scrapers\ProductScraper;

$data = ProductScraper::scrape('https://example.com/product/123')->run();

With Options

$data = ProductScraper::scrape('https://example.com/product/123')
    ->timeout(20000)                    // 20 seconds timeout
    ->proxy('ip:port', 'user', 'pass')  // Use proxy
    ->headers(['Accept-Language' => 'en']) // Custom headers
    ->retry(3, 5)                       // Retry 3 times, wait 5 seconds
    ->cache(false)                      // Disable caching
    ->run();

With Parameters

You can pass parameters to the handle() method:

<?php

namespace App\Scrapers;

use Shammaa\LaravelSmartScraper\Scraper;

class ProductScraper extends Scraper
{
    protected function handle(string $selector = 'h1'): array
    {
        $crawler = $this->getCrawler();
        
        return [
            'title' => $crawler->filter($selector)->text(''),
        ];
    }
}

Then use it:

$data = ProductScraper::scrape('https://example.com/product/123')
    ->run(selector: '.product-title');

🔧 Advanced Features

Caching

The scraper automatically caches results to avoid redundant requests:

// Enable caching (default)
$data = ProductScraper::scrape('https://example.com/product/123')
    ->cache(true)
    ->run();

// Disable caching
$data = ProductScraper::scrape('https://example.com/product/123')
    ->cache(false)
    ->run();

Cache TTL can be configured in config/smart-scraper.php:

'cache' => [
    'ttl' => 3600, // 1 hour
],

Rate Limiting

Prevent overwhelming target websites with rate limiting:

// Rate limiting is enabled by default
$data = ProductScraper::scrape('https://example.com/product/123')->run();

// Disable rate limiting
$data = ProductScraper::scrape('https://example.com/product/123')
    ->rateLimit(false)
    ->run();

Configure rate limits in config/smart-scraper.php:

'rate_limit' => [
    'enabled' => true,
    'max_requests' => 10,    // Max 10 requests
    'per_seconds' => 60,     // Per 60 seconds
],

User-Agent Rotation

User agents are automatically rotated to avoid detection:

// Rotation is enabled by default
$data = ProductScraper::scrape('https://example.com/product/123')->run();

Configure user agents in config/smart-scraper.php:

'user_agent' => [
    'rotation_enabled' => true,
    'agents' => [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ...',
        // Add more user agents
    ],
],

Proxy Support

Full proxy support with authentication:

// Without authentication
$data = ProductScraper::scrape('https://example.com/product/123')
    ->proxy('200.20.14.84:40200')
    ->run();

// With authentication
$data = ProductScraper::scrape('https://example.com/product/123')
    ->proxy('200.20.14.84:40200', 'username', 'password')
    ->run();

Retry Logic

Automatic retry with exponential backoff:

// Retry 3 times, wait 1 second between attempts
$data = ProductScraper::scrape('https://example.com/product/123')
    ->retry(3, 1)
    ->run();

Configure default retry settings in config/smart-scraper.php:

'retry' => [
    'enabled' => true,
    'max_attempts' => 3,
    'initial_delay' => 1,        // seconds
    'max_delay' => 60,           // seconds
    'backoff_multiplier' => 2,   // Exponential backoff
    'retryable_status_codes' => [408, 429, 500, 502, 503, 504],
],

Screenshots

Capture screenshots of web pages:

// Save screenshot to file
$data = ProductScraper::scrape('https://example.com/product/123')
    ->screenshot(true, storage_path('app/screenshots/product.png'))
    ->run();

// Get screenshot as base64
$data = ProductScraper::scrape('https://example.com/product/123')
    ->screenshot(true)
    ->run();

$screenshotBase64 = $data['screenshot'] ?? null;

PDF Generation

Generate PDFs from web pages:

// Save PDF to file
$data = ProductScraper::scrape('https://example.com/product/123')
    ->pdf(true, storage_path('app/pdfs/product.pdf'))
    ->run();

// Get PDF as base64
$data = ProductScraper::scrape('https://example.com/product/123')
    ->pdf(true)
    ->run();

$pdfBase64 = $data['pdf'] ?? null;

Custom Headers

Add custom headers to requests:

$data = ProductScraper::scrape('https://example.com/product/123')
    ->headers([
        'Accept-Language' => 'en-US,en;q=0.9',
        'Accept' => 'text/html,application/xhtml+xml',
        'X-Custom-Header' => 'value',
    ])
    ->run();

Data Validation

Validate extracted data against a schema:

use Shammaa\LaravelSmartScraper\Services\SchemaValidatorService;

$data = ProductScraper::scrape('https://example.com/product/123')
    ->validate(function ($data) {
        $validator = new SchemaValidatorService();
        
        return $validator->validate($data, [
            'title' => ['required' => true, 'type' => 'string'],
            'price' => ['required' => true, 'type' => 'string'],
            'description' => ['required' => false, 'type' => 'string'],
        ]);
    })
    ->run();

Middleware

Create custom middleware to modify requests:

use Shammaa\LaravelSmartScraper\Contracts\MiddlewareInterface;

class CustomHeaderMiddleware implements MiddlewareInterface
{
    public function handle(array $options): array
    {
        $options['headers']['X-Custom'] = 'value';
        return $options;
    }
}

// Use middleware
$data = ProductScraper::scrape('https://example.com/product/123')
    ->middleware(new CustomHeaderMiddleware())
    ->run();

🎨 Artisan Commands

Create a Scraper

php artisan make:scraper ProductScraper

List All Scrapers

php artisan list:scrapers

Test a Scraper

php artisan scraper:test "App\Scrapers\ProductScraper" "https://example.com/product/123"

🧠 Smart Site Detection & Multi-Site Support

The scraper can intelligently detect different websites and automatically use the appropriate selectors for each site. This means you can scrape multiple websites with different HTML structures using the same scraper class!

How It Works

Automatic Site Detection - The scraper detects the site type from URL patterns or HTML patterns
Smart Selectors - Uses site-specific selectors if available, falls back to generic selectors
Fallback System - If a selector fails, it automatically tries the next one

Creating a Smart Scraper

Use the --smart flag when creating a scraper:

php artisan make:scraper ProductScraper --smart

This creates a scraper with smart selectors:

<?php

namespace App\Scrapers;

use Shammaa\LaravelSmartScraper\Scraper;

class ProductScraper extends Scraper
{
    protected function handle(): array
    {
        $smart = $this->smart();

        return [
            // Smart extraction - tries multiple selectors automatically
            'title' => $smart->extract('title', [
                'h1',
                '.title',
                '[itemprop="name"]',
                'title',
            ]),

            'price' => $smart->extract('price', [
                '.price',
                '[itemprop="price"]',
                '.amount',
                '.cost',
            ]),

            'image' => $smart->extractAttribute('image', [
                'img.main-image',
                '.product-image img',
                '[itemprop="image"]',
                'img',
            ], 'src'),

            'description' => $smart->extract('description', [
                '.description',
                '[itemprop="description"]',
                '.content',
                'p',
            ]),
        ];
    }
}

Smart Selector Methods

`extract()` - Extract text content

Tries multiple selectors until one works:

$title = $smart->extract('title', [
    'h1.product-title',
    'h1',
    '.title',
    '[itemprop="name"]',
], 'Default Title');

`extractAttribute()` - Extract attribute value

Tries multiple selectors to extract an attribute:

$image = $smart->extractAttribute('image', [
    'img.main-image',
    '.product-image img',
    '[itemprop="image"]',
], 'src', 'default.jpg');

`extractMultiple()` - Extract array of values

Extracts multiple elements:

$tags = $smart->extractMultiple('tags', [
    '.tag',
    '.tags a',
    '[itemprop="keywords"]',
], function ($node) {
    return $node->text();
});

Site Profiles

Define site profiles in config/smart-scraper.php:

'site_profiles' => [
    'amazon' => [
        'url_patterns' => [
            '/amazon\.(com|co\.uk|de|fr|it|es|ca|com\.au)/',
        ],
        'html_patterns' => [
            '#nav-logo' => null,
            '[data-asin]' => null,
        ],
        'selectors' => [
            'title' => [
                '#productTitle',
                'h1.a-size-large',
                'h1',
            ],
            'price' => [
                '.a-price .a-offscreen',
                '#priceblock_dealprice',
                '#priceblock_saleprice',
            ],
        ],
    ],

    'ebay' => [
        'url_patterns' => [
            '/ebay\.(com|co\.uk|de|fr|it|es|ca|com\.au)/',
        ],
        'html_patterns' => [
            '#gh-logo' => null,
            '[data-testid="x-item-title-label"]' => null,
        ],
        'selectors' => [
            'title' => [
                'h1[data-testid="x-item-title-label"]',
                'h1.it-ttl',
                'h1',
            ],
            'price' => [
                '.notranslate',
                '.u-flL.condText',
            ],
        ],
    ],
],

How Site Detection Works

URL Pattern Matching - First, tries to match URL patterns
HTML Pattern Matching - If URL doesn't match, analyzes HTML structure
Selector Priority - Uses site-specific selectors first, then falls back to generic ones

Example: Scraping Multiple Sites

use App\Scrapers\ProductScraper;

// Works with Amazon
$amazonData = ProductScraper::scrape('https://amazon.com/product/123')->run();

// Works with eBay
$ebayData = ProductScraper::scrape('https://ebay.com/itm/123')->run();

// Works with any e-commerce site
$genericData = ProductScraper::scrape('https://example-shop.com/product/123')->run();

The same scraper automatically adapts to each site's structure!

Manual Site Type Detection

You can also manually set or check the site type:

protected function handle(): array
{
    $siteType = $this->getSiteType(); // 'amazon', 'ebay', null, etc.
    
    if ($siteType === 'amazon') {
        // Amazon-specific logic
    } elseif ($siteType === 'ebay') {
        // eBay-specific logic
    }
    
    // Or set manually
    $this->setSiteType('custom-site');
    
    return [];
}

🔍 Monitoring & Logging

Monitoring is enabled by default. All scraping activities are logged:

// Logs are automatically created
$data = ProductScraper::scrape('https://example.com/product/123')->run();

Check logs in storage/logs/laravel.log:

[2024-01-01 12:00:00] local.INFO: Scraping started {"url":"https://example.com/product/123",...}
[2024-01-01 12:00:02] local.INFO: Scraping completed {"url":"https://example.com/product/123","duration":"2.5s",...}

Configure monitoring in config/smart-scraper.php:

'monitoring' => [
    'enabled' => true,
    'log_channel' => 'stack',
    'track_metrics' => true,
],

⚙️ Configuration

All configuration options are available in config/smart-scraper.php:

return [
    'cache' => [
        'driver' => 'file',
        'ttl' => 3600,
        'prefix' => 'smart_scraper',
    ],
    
    'rate_limit' => [
        'enabled' => true,
        'max_requests' => 10,
        'per_seconds' => 60,
    ],
    
    'puppeteer' => [
        'node_path' => 'node',
        'script_path' => __DIR__ . '/../resources/js/scraper.js',
        'timeout' => 30000,
        'headless' => true,
    ],
    
    // ... more options
];

🐛 Troubleshooting

NVM Configuration

If you're using Node.js via NVM and running scrapers via scheduled tasks, Node might not be available. To fix this:

Edit your ~/.bash_profile:

nano ~/.bash_profile

Add at the top:

export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh"
[ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion"

Reload:

source ~/.bash_profile

Note: It's not recommended to use NVM in production environments.

Common Issues

Issue: Puppeteer execution failed

Solution: Make sure Node.js and Puppeteer dependencies are installed:

node --version
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth

Issue: Rate limit exceeded

Solution: Adjust rate limit settings or disable rate limiting:

->rateLimit(false)

Issue: Data validation failed

Solution: Check your validation schema and ensure data matches expected types.

📝 Examples

Example 1: E-commerce Product Scraper

<?php

namespace App\Scrapers;

use Shammaa\LaravelSmartScraper\Scraper;

class ProductScraper extends Scraper
{
    protected function handle(): array
    {
        $crawler = $this->getCrawler();

        return [
            'title' => $crawler->filter('h1.product-title')->text(''),
            'price' => $crawler->filter('.price')->text(''),
            'currency' => $crawler->filter('.currency')->text(''),
            'description' => $crawler->filter('.product-description')->text(''),
            'images' => $crawler->filter('.product-images img')->each(function ($node) {
                return $node->attr('src');
            }),
            'rating' => $crawler->filter('.rating')->text(''),
            'reviews_count' => $crawler->filter('.reviews-count')->text(''),
        ];
    }
}

// Usage
$data = ProductScraper::scrape('https://example.com/product/123')
    ->timeout(15000)
    ->retry(3, 2)
    ->run();

Example 2: News Article Scraper

<?php

namespace App\Scrapers;

use Shammaa\LaravelSmartScraper\Scraper;

class NewsScraper extends Scraper
{
    protected function handle(): array
    {
        $crawler = $this->getCrawler();

        return [
            'title' => $crawler->filter('h1.article-title')->text(''),
            'author' => $crawler->filter('.article-author')->text(''),
            'published_at' => $crawler->filter('.article-date')->attr('datetime'),
            'content' => $crawler->filter('.article-content')->html(),
            'tags' => $crawler->filter('.article-tags a')->each(function ($node) {
                return $node->text();
            }),
            'image' => $crawler->filter('.article-image img')->attr('src'),
        ];
    }
}

// Usage with screenshot
$data = NewsScraper::scrape('https://example.com/news/article-123')
    ->screenshot(true, storage_path('app/screenshots/article.png'))
    ->run();

Example 3: Multiple URLs Concurrently

use App\Scrapers\ProductScraper;
use Shammaa\LaravelSmartScraper\Services\ConcurrentScraperService;

$urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/3',
];

$concurrentScraper = new ConcurrentScraperService(maxConcurrent: 5);

$results = $concurrentScraper->scrape($urls, function ($url) {
    return ProductScraper::scrape($url)->run();
});

foreach ($results as $url => $data) {
    echo "Scraped: {$url}\n";
    print_r($data);
}

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📄 License

This package is open-sourced software licensed under the MIT license.

🙏 Credits

Built with ❤️ by Shadi Shammaa

Made with Laravel Smart Scraper - Professional web scraping made easy! 🚀

shammaa / laravel-smart-scraper

Maintainers

Package info

Statistics

Security