README

PHP Client for WaterCrawl REST APIs. This package provides a simple and elegant way to interact with WaterCrawl's web scraping and crawling services.

Installation

You can install the package via composer:

composer require watercrawl/php

Requirements

PHP 7.4 or higher
ext-mbstring
ext-json

Usage

use WaterCrawl\APIClient;

// Initialize the client
$client = new APIClient('your-api-key');

// Scrape a single URL
$result = $client->scrapeUrl('https://example.com');

// Create a crawl request
$result = $client->createCrawlRequest(
    'https://example.com',
    ['allowed_domains' => ['example.com']],
    ['wait_time' => 1000]
);

// Monitor crawl progress
foreach ($client->monitorCrawlRequest($result['uuid']) as $update) {
    if ($update['type'] === 'result') {
        // Process the result
        print_r($update['data']);
    }
}

API Examples

Crawling Operations

List all crawl requests

// Get the first page of requests (default page size: 10)
$requests = $client->getCrawlRequestsList();

// Specify page number and size
$requests = $client->getCrawlRequestsList(2, 20);

Get a specific crawl request

$request = $client->getCrawlRequest('request-uuid');

Create a crawl request

// Simple request with just a URL
$request = $client->createCrawlRequest('https://example.com');

// Advanced request with options
$request = $client->createCrawlRequest(
    'https://example.com',
    [
        'max_depth' => 1, // maximum depth to crawl
        'page_limit' => 1, // maximum number of pages to crawl
        'allowed_domains' => [], // allowed domains to crawl
        'exclude_paths' => [], // exclude paths
        'include_paths' => [] // include paths
    ],
    [
        'exclude_tags' => [], // exclude tags from the page
        'include_tags' => [], // include tags from the page
        'wait_time' => 1000, // wait time in milliseconds after page load
        'include_html' => false, // the result will include HTML
        'only_main_content' => true, // only main content of the page
        'include_links' => false, // if true the result will include links
        'timeout' => 15000, // timeout in milliseconds
        'accept_cookies_selector' => null, // accept cookies selector
        'locale' => "en-US", // locale
        'extra_headers' => [], // extra headers
        'actions' => [] // actions to perform
    ],
    [] // plugin options
);

Stop a crawl request

$client->stopCrawlRequest('request-uuid');

Download a crawl request result

// Download the crawl request results
$results = $client->downloadCrawlRequest('request-uuid');

Monitor a crawl request

// Monitor with automatic result download (default)
foreach ($client->monitorCrawlRequest('request-uuid') as $event) {
    if ($event['type'] === 'state') {
        echo "Crawl state: {$event['data']['status']}\n";
    } elseif ($event['type'] === 'result') {
        echo "Received result for: {$event['data']['url']}\n";
    }
}

// Monitor without downloading results
foreach ($client->monitorCrawlRequest('request-uuid', false) as $event) {
    echo "Event type: {$event['type']}\n";
}

Get crawl request results

// Get the first page of results
$results = $client->getCrawlRequestResults('request-uuid');

// Specify page number and size
$results = $client->getCrawlRequestResults('request-uuid', 2, 20);

Quick URL scraping

// Synchronous scraping (default)
$result = $client->scrapeUrl('https://example.com');

// With page options
$result = $client->scrapeUrl(
    'https://example.com',
    [
        'wait_time' => 1000,
        'only_main_content' => true
    ]
);

// Asynchronous scraping
$request = $client->scrapeUrl('https://example.com', [], [], false);
// Later check for results with getCrawlRequest

Sitemap Operations

Download a sitemap

// Download using a crawl request object
$crawlRequest = $client->getCrawlRequest('request-uuid');
$sitemap = $client->downloadSitemap($crawlRequest);

// Or download using just the UUID
$sitemap = $client->downloadSitemap('request-uuid');

// Process sitemap entries
foreach ($sitemap as $entry) {
    echo "URL: {$entry['url']}, Title: {$entry['title']}\n";
}

Download sitemap as graph data

// You need to provide crawl request uuid or crawl request object
$graphData = $client->downloadSitemapGraph('request-uuid');

Download sitemap as markdown

// You need to provide crawl request uuid or crawl request object
$markdown = $client->downloadSitemapMarkdown('request-uuid');

// Save to a file
file_put_contents('sitemap.md', $markdown);

Search Operations

Get search requests list

// Get the first page of search requests
$searchRequests = $client->getSearchRequestsList();

// Specify page number and size
$searchRequests = $client->getSearchRequestsList(2, 20);

Create a search request

// Simple search with synchronous results
$results = $client->createSearchRequest('php programming');

// Search with options and limited results
$results = $client->createSearchRequest(
    'php tutorial',
    [
        'language' => null, // language code e.g. "en" or "fr"
        'country' => null, // country code e.g. "us" or "fr"
        'time_range' => 'any', // time range e.g. "any", "hour", "day", "week", "month", "year"
        'search_type' => 'web', // search type e.g. "web"
        'depth' => 'basic' // depth e.g. "basic", "advanced", "ultimate"
    ],
    5, // limit the number of results
    true, // wait for results
    true // download results
);

// Asynchronous search
$searchRequest = $client->createSearchRequest(
    'machine learning',
    [], // search options
    5, // limit the number of results
    false, // Don't wait for results
    false // Don't download results
);

Monitor a search request

// Monitor a search request
foreach ($client->monitorSearchRequest('search-uuid') as $event) {
    if ($event['type'] === 'state') {
        echo "Search state: {$event['status']}\n";
    }
}

// Monitor without downloading results
foreach ($client->monitorSearchRequest('search-uuid', false) as $event) {
    echo "Event: " . json_encode($event) . "\n";
}

Get a search request

$searchRequest = $client->getSearchRequest('search-uuid');

Stop a search request

$client->stopSearchRequest('search-uuid');

Features

Simple and intuitive API
Real-time crawl monitoring
Configurable scraping options
Automatic response handling
Support for sitemaps and search operations
PHP 7.4+ compatibility
Proper UTF-8 support

Testing

composer test

Compatibility

WaterCrawl API >= 0.7.1

Changelog

Please see CHANGELOG.md for more information on what has changed recently.

Contributing

Please see CONTRIBUTING.md for details.

Security

If you discover any security related issues, please email security@watercrawl.dev instead of using the issue tracker.

License

The MIT License (MIT). Please see License File for more information.

watercrawl / php

Maintainers

Details