watercrawl / php
PHP SDK for WaterCrawl REST APIs
v1.1.0
2025-04-30 20:04 UTC
Requires
- php: ^7.4 || ^8.0 || ^8.1 || ^8.2 || ^8.3 || ^8.4
- ext-json: *
- ext-mbstring: *
- guzzlehttp/guzzle: ^7.5
Requires (Dev)
- phpunit/phpunit: ^9.6
- squizlabs/php_codesniffer: ^3.7
This package is auto-updated.
Last update: 2025-04-30 20:05:13 UTC
README
PHP Client for WaterCrawl REST APIs. This package provides a simple and elegant way to interact with WaterCrawl's web scraping and crawling services.
Installation
You can install the package via composer:
composer require watercrawl/php
Requirements
- PHP 7.4 or higher
ext-mbstring
ext-json
Usage
use WaterCrawl\APIClient; // Initialize the client $client = new APIClient('your-api-key'); // Scrape a single URL $result = $client->scrapeUrl('https://example.com'); // Create a crawl request $result = $client->createCrawlRequest( 'https://example.com', ['allowed_domains' => ['example.com']], ['wait_time' => 1000] ); // Monitor crawl progress foreach ($client->monitorCrawlRequest($result['uuid']) as $update) { if ($update['type'] === 'result') { // Process the result print_r($update['data']); } }
API Examples
Crawling Operations
List all crawl requests
// Get the first page of requests (default page size: 10) $requests = $client->getCrawlRequestsList(); // Specify page number and size $requests = $client->getCrawlRequestsList(2, 20);
Get a specific crawl request
$request = $client->getCrawlRequest('request-uuid');
Create a crawl request
// Simple request with just a URL $request = $client->createCrawlRequest('https://example.com'); // Advanced request with options $request = $client->createCrawlRequest( 'https://example.com', [ 'max_depth' => 1, // maximum depth to crawl 'page_limit' => 1, // maximum number of pages to crawl 'allowed_domains' => [], // allowed domains to crawl 'exclude_paths' => [], // exclude paths 'include_paths' => [] // include paths ], [ 'exclude_tags' => [], // exclude tags from the page 'include_tags' => [], // include tags from the page 'wait_time' => 1000, // wait time in milliseconds after page load 'include_html' => false, // the result will include HTML 'only_main_content' => true, // only main content of the page 'include_links' => false, // if true the result will include links 'timeout' => 15000, // timeout in milliseconds 'accept_cookies_selector' => null, // accept cookies selector 'locale' => "en-US", // locale 'extra_headers' => [], // extra headers 'actions' => [] // actions to perform ], [] // plugin options );
Stop a crawl request
$client->stopCrawlRequest('request-uuid');
Download a crawl request result
// Download the crawl request results $results = $client->downloadCrawlRequest('request-uuid');
Monitor a crawl request
// Monitor with automatic result download (default) foreach ($client->monitorCrawlRequest('request-uuid') as $event) { if ($event['type'] === 'state') { echo "Crawl state: {$event['data']['status']}\n"; } elseif ($event['type'] === 'result') { echo "Received result for: {$event['data']['url']}\n"; } } // Monitor without downloading results foreach ($client->monitorCrawlRequest('request-uuid', false) as $event) { echo "Event type: {$event['type']}\n"; }
Get crawl request results
// Get the first page of results $results = $client->getCrawlRequestResults('request-uuid'); // Specify page number and size $results = $client->getCrawlRequestResults('request-uuid', 2, 20);
Quick URL scraping
// Synchronous scraping (default) $result = $client->scrapeUrl('https://example.com'); // With page options $result = $client->scrapeUrl( 'https://example.com', [ 'wait_time' => 1000, 'only_main_content' => true ] ); // Asynchronous scraping $request = $client->scrapeUrl('https://example.com', [], [], false); // Later check for results with getCrawlRequest
Sitemap Operations
Download a sitemap
// Download using a crawl request object $crawlRequest = $client->getCrawlRequest('request-uuid'); $sitemap = $client->downloadSitemap($crawlRequest); // Or download using just the UUID $sitemap = $client->downloadSitemap('request-uuid'); // Process sitemap entries foreach ($sitemap as $entry) { echo "URL: {$entry['url']}, Title: {$entry['title']}\n"; }
Download sitemap as graph data
// You need to provide crawl request uuid or crawl request object $graphData = $client->downloadSitemapGraph('request-uuid');
Download sitemap as markdown
// You need to provide crawl request uuid or crawl request object $markdown = $client->downloadSitemapMarkdown('request-uuid'); // Save to a file file_put_contents('sitemap.md', $markdown);
Search Operations
Get search requests list
// Get the first page of search requests $searchRequests = $client->getSearchRequestsList(); // Specify page number and size $searchRequests = $client->getSearchRequestsList(2, 20);
Create a search request
// Simple search with synchronous results $results = $client->createSearchRequest('php programming'); // Search with options and limited results $results = $client->createSearchRequest( 'php tutorial', [ 'language' => null, // language code e.g. "en" or "fr" 'country' => null, // country code e.g. "us" or "fr" 'time_range' => 'any', // time range e.g. "any", "hour", "day", "week", "month", "year" 'search_type' => 'web', // search type e.g. "web" 'depth' => 'basic' // depth e.g. "basic", "advanced", "ultimate" ], 5, // limit the number of results true, // wait for results true // download results ); // Asynchronous search $searchRequest = $client->createSearchRequest( 'machine learning', [], // search options 5, // limit the number of results false, // Don't wait for results false // Don't download results );
Monitor a search request
// Monitor a search request foreach ($client->monitorSearchRequest('search-uuid') as $event) { if ($event['type'] === 'state') { echo "Search state: {$event['status']}\n"; } } // Monitor without downloading results foreach ($client->monitorSearchRequest('search-uuid', false) as $event) { echo "Event: " . json_encode($event) . "\n"; }
Get a search request
$searchRequest = $client->getSearchRequest('search-uuid');
Stop a search request
$client->stopSearchRequest('search-uuid');
Features
- Simple and intuitive API
- Real-time crawl monitoring
- Configurable scraping options
- Automatic response handling
- Support for sitemaps and search operations
- PHP 7.4+ compatibility
- Proper UTF-8 support
Testing
composer test
Compatibility
- WaterCrawl API >= 0.7.1
Changelog
Please see CHANGELOG.md for more information on what has changed recently.
Contributing
Please see CONTRIBUTING.md for details.
Security
If you discover any security related issues, please email security@watercrawl.dev instead of using the issue tracker.
License
The MIT License (MIT). Please see License File for more information.