fozbek / scrawler
Simple, schema based scraping tool
Installs: 419
Dependents: 0
Suggesters: 0
Security: 0
Stars: 12
Watchers: 1
Forks: 0
Open Issues: 0
pkg:composer/fozbek/scrawler
Requires
- php: ^8.1
- ext-dom: *
- guzzlehttp/guzzle: ~6.0|^7.0.1
- imangazaliev/didom: ^2.0
Requires (Dev)
- phpstan/phpstan: ^0.12.34
- phpunit/phpunit: ^9
- roave/security-advisories: dev-latest
This package is auto-updated.
Last update: 2025-11-16 23:08:57 UTC
README
A modern, schema-based web scraping library for PHP with powerful transformers and a clean, intuitive syntax. Perfect for both manual use and API integration.
Features
- Intuitive Schema Syntax: Easy to write by hand and by AI
- Built-in Transformers: 20+ transformers for data manipulation (trim, float, int, upper, lower, etc.)
- Flexible Lists: Support for limit and offset
- JSON-Friendly: Perfect for API usage
- Type-Safe: Full PHPStan max level compliance
- Clean Architecture: SOLID principles, no anti-patterns
- Well-Tested: 47 tests, 107 assertions
Installation
composer require fozbek/scrawler
Quick Start
use Scrawler\Bootstrap; use Scrawler\Scrawler; // Handle PHP 8.4 deprecation warnings from vendor libraries (optional) Bootstrap::init(); $scrawler = new Scrawler(); $schema = [ 'title' => 'h1', 'price' => ['span.price', 'trim|float'], 'items' => [ 'li' => [ 'text' => [null, 'trim|upper'] ], 'limit' => 5 ] ]; $data = $scrawler->scrape('https://example.com', $schema);
PHP 8.4 Compatibility
If you're running PHP 8.4+, you may see deprecation warnings from vendor libraries (DiDom, Guzzle) related to implicitly nullable parameters. These are harmless but can clutter output. Use Bootstrap::init() to suppress these vendor-specific warnings:
use Scrawler\Bootstrap; Bootstrap::init(); // Call once at the start of your script
This only suppresses deprecation warnings from vendor code, keeping your own code's warnings intact.
Schema Syntax
Simple Text Extraction
$schema = [ 'title' => 'h1', 'description' => '.content p' ];
Attribute Extraction
$schema = [ 'image' => 'img@src', 'link' => 'a@href', 'dataId' => 'div@data-id' ];
Extracting attributes from the current element (useful in lists):
$schema = [ 'items' => [ '.product' => [ 'id' => '@id', // Get id attribute from .product element 'data' => '@data-value', // Get data-value attribute 'name' => '.title' // Get text from nested .title ] ] ];
Transformers
Apply transformations using pipe-separated transformer names:
$schema = [ 'price' => ['span.price', 'trim|float'], 'name' => ['.product-name', 'trim|upper'], 'url' => ['a@href', 'urldecode'] ];
Available Transformers:
Type Conversions:
int,float,bool,string
String Operations:
trim,ltrim,rtrimupper,lower,ucfirst,ucwordsstrip_tags
URL/Path:
basename,dirnameurlencode,urldecode
Parsing:
json- decode JSON stringstimestamp- convert dates to Unix timestamp
Utility:
abs- absolute valuemd5,sha1- hashing
Lists (New Syntax)
Simple list:
$schema = [ 'items' => [ 'li' => [ 'text' => null // Current element text ] ] ];
List with transformers:
$schema = [ 'products' => [ '.product' => [ 'name' => ['.name', 'trim|ucwords'], 'price' => ['.price', 'trim|float'] ] ] ];
List with limit and offset:
$schema = [ 'items' => [ 'li' => ['text' => null], 'limit' => 10, // Take only first 10 'offset' => 5 // Skip first 5 ] ];
Old syntax still supported:
$schema = [ 'items' => [ 'list-selector' => 'li', 'content' => [ 'text' => null ] ] ];
Nested Lists
$schema = [ 'categories' => [ '.category' => [ 'name' => '.category-name', 'products' => [ '.product' => [ 'name' => ['.name', 'trim'], 'price' => ['.price', 'trim|float'] ], 'limit' => 5 ] ] ] ];
Examples
Scraping with Transformers
$html = ' <div class="product"> <h2> wireless headphones </h2> <span class="price"> $59.99 </span> <a href="/products/item%20123">Details</a> </div> '; $schema = [ 'name' => ['h2', 'trim|ucwords'], 'price' => ['.price', 'trim|float'], 'url' => ['a@href', 'urldecode'] ]; $result = $scrawler->scrape($html, $schema, true); // Output: // [ // 'name' => 'Wireless Headphones', // 'price' => 59.99, // 'url' => '/products/item 123' // ]
Scraping Lists with Limits
$html = '<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li>'; $schema = [ 'items' => [ 'li' => ['text' => null], 'offset' => 1, 'limit' => 3 ] ]; $result = $scrawler->scrape($html, $schema, true); // Output: ['items' => [['text' => '2'], ['text' => '3'], ['text' => '4']]]
Complex Real-World Example
$schema = [ 'title' => ['h1', 'trim|upper'], 'author' => '.meta .author', 'publishedAt' => ['.meta .date', 'timestamp'], 'content' => ['.content', 'trim|strip_tags'], 'tags' => [ '.tag' => [ 'name' => [null, 'trim|lower'], 'url' => ['a@href', 'urldecode'] ], 'limit' => 10 ] ];
JSON API Usage
The schema syntax is designed to work seamlessly with JSON:
{
"title": ["h1", "trim|upper"],
"price": ["span.price", "trim|float"],
"products": {
".product": {
"name": [".name", "trim"],
"price": [".price", "trim|float"]
},
"limit": 10,
"offset": 0
}
}
Note: Callbacks and filtering should be handled by the API consumer after receiving the data.
Custom HTTP Client
use GuzzleHttp\Client; use Scrawler\Scrawler; $client = new Client([ 'timeout' => 30, 'headers' => ['User-Agent' => 'My Bot/1.0'], 'proxy' => 'http://proxy.example.com:8080' ]); $scrawler = new Scrawler($client);
Testing
# Run all tests composer test # Run specific test ./vendor/bin/phpunit tests/ScrawlerNewSyntaxTest.php # With coverage composer coverage
Static Analysis
composer analyse
PHPStan Level: Max (strictest)
Requirements
- PHP 8.1 or higher
- ext-dom
- Guzzle 6.0 or 7.0+
- DiDom 2.0+
License
MIT License - see LICENSE
Contributing
Contributions welcome! Please ensure:
- All tests pass
- PHPStan analysis passes
- Follow PSR-12
Author
Fatih Özbek - mail@fatih.dev