duzun / hquery
An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+
Installs: 100 091
Dependents: 7
Suggesters: 0
Security: 0
Stars: 361
Watchers: 24
Forks: 74
Open Issues: 17
Requires
- php: >=5.3
Requires (Dev)
Suggests
- php-http/discovery: Might be required by hQuery::sendRequest()
- php-http/message: Might be required by hQuery::fromHTML($message) or hQuery::fromURL()
- php-http/socket-client: Could be used to make HTTP requests before calling hQuery::fromHTML($message)
- dev-master
- 3.3.0
- 3.2.0
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.2.4
- 2.2.3
- 2.2.2
- 2.2.1
- 2.2.0
- 2.1.0
- 2.0.3
- 2.0.2
- 2.0.1
- 1.7.4
- 1.7.3
- 1.7.2
- 1.7.1
- 1.7.0
- 1.6.2
- 1.6.1
- 1.6.0
- 1.5.3
- 1.5.2
- 1.5.1
- 1.5.0
- 1.4.3
- 1.4.2
- 1.4.1
- 1.4.0
- 1.3.0
- 1.2.5
- 1.2.4
- 1.2.3
- 1.2.2
- 1.2.1
- 1.2.0
- 1.1.3
- 1.1.2
- 1.1.1
- 1.1.0
- dev-big_synthetic
- dev-dependabot/npm_and_yarn/multi-a9f852c250
- dev-10x
This package is auto-updated.
Last update: 2025-01-12 11:24:22 UTC
README
An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.
You can use the familiar jQuery/CSS selector syntax to easily find the data you need.
In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.
See tests/README.md.
💡 Features
- Very fast parsing and lookup
- Parses broken HTML
- jQuery-like style of DOM traversal
- Low memory usage
- Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
- Doesn't require cURL to be installed and automatically handles redirects (see hQuery::fromUrl())
- Caches response for multiple processing tasks
- PSR-7 friendly (see hQuery::fromHTML($message))
- PHP 5.3+
- No dependencies
🛠 Install
Just add this folder to your project and include_once 'hquery.php';
and you are ready to hQuery
.
Alternatively composer require duzun/hquery
or using npm install hquery.php
, require_once 'node_modules/hquery.php/hquery.php';
.
⚙ Usage
Basic setup:
// Optionally use namespaces use duzun\hQuery; // Either use composer, or include this file: include_once '/path/to/libs/hquery.php'; // Set the cache path - must be a writable folder // If not set, hQuery::fromURL() would make a new request on each call hQuery::$cache_path = "/path/to/cache"; // Time to keep request data in cache, seconds // A value of 0 disables cache hQuery::$cache_expires = 3600; // default one hour
I would recommend using php-http/cache-plugin with a PSR-7 client for better flexibility.
Load HTML from a file
hQuery::fromFile( string $filename
, boolean $use_include_path
= false, resource $context
= NULL )
// Local $doc = hQuery::fromFile('/path/to/filesystem/doc.html'); // Remote $doc = hQuery::fromFile('https://example.com/', false, $context);
Where $context
is created with stream_context_create().
For an example of using $context
to make a HTTP request with proxy see #26.
Load HTML from a string
hQuery::fromHTML( string $html
, string $url
= NULL )
$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>'); // Set base_url, in case the document is loaded from local source. // Note: The base_url property is used to retrieve absolute URLs from relative ones. $doc->base_url = 'http://desired-host.net/path';
Load a remote HTML document
hQuery::fromUrl( string $url
, array $headers
= NULL, array|string $body
= NULL, array $options
= NULL )
use duzun\hQuery; // GET the document $doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']); var_dump($doc->headers); // See response headers var_dump(hQuery::$last_http_result); // See response details of last request // with POST $doc = hQuery::fromUrl( 'http://example.com/someDoc.html', // url ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers ['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well ['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options );
For building advanced requests (POST, parameters etc) see hQuery::http_wr(),
though I recommend using a specialized (PSR-7?) library for making requests
and hQuery::fromHTML($html, $url=NULL)
for processing results.
See Guzzle for eg.
PSR-7 example:
composer require php-http/message php-http/discovery php-http/curl-client
If you don't have cURL PHP extension,
just replace php-http/curl-client
with php-http/socket-client
in the above command.
use duzun\hQuery; use Http\Discovery\HttpClientDiscovery; use Http\Discovery\MessageFactoryDiscovery; $client = HttpClientDiscovery::find(); $messageFactory = MessageFactoryDiscovery::find(); $request = $messageFactory->createRequest( 'GET', 'http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'] ); $response = $client->sendRequest($request); $doc = hQuery::fromHTML($response, $request->getUri());
Another option is to use stream_context_create()
to create a $context
, then call hQuery::fromFile($url, false, $context)
.
Processing the results
hQuery::find( string $sel
, array|string $attr
= NULL, hQuery\Node $ctx
= NULL )
// Find all banners (images inside anchors) $banners = $doc->find('a[href] > img[src]:parent'); // Extract links and images $links = array(); $images = array(); $titles = array(); // If the result of find() is not empty // $banners is a collection of elements (hQuery\Element) if ( $banners ) { // Iterate over the result foreach($banners as $pos => $a) { // $a->href property is the resolved $a->attr('href') relative to the // documents <base href=...>, if present, or $doc->baseURL. $links[$pos] = $a->href; // get absolute URL from href property $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text // Filter the result if ( !$a->hasClass('logo') ) { // $a->style property is the parsed $a->attr('style'), same as $a->attr('style', true) if ( strtolower($a->style['position']) == 'fixed' ) continue; $img = $a->find('img')[0]; // ArrayAccess if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src', true) } } // If at least one element has the class .home if ( $banners->hasClass('home') ) { echo 'There is .home button!', PHP_EOL; // ArrayAccess for elements and properties. if ( $banners[0]['href'] == '/' ) { echo 'And it is the first one!'; } } } // Read charset of the original document (internally it is converted to UTF-8) $charset = $doc->charset; // Get the size of the document ( strlen($html) ) $size = $doc->size; // The URL at which the document was requested $requestUri = $doc->href; // <base href=...>, if present, or the origin + dir path part from $doc->href. // The .href and .src props are resolved using this value. $baseURL = $doc->baseURL;
Note: In case the charset meta attribute has a wrong value or the internal conversion fails for any other reason, hQuery
would ignore the error and continue processing with the original HTML, but would register an error message on $doc->html_errors['convert_encoding']
.
🖧 Live Demo
On DUzun.Me
A lot of people ask for sources of my Live Demo page. Here we go:
view-source:https://duzun.me/playground/hquery
🏃 Run the playground
You can easily run any of the examples/
on your local machine.
All you need is PHP installed in your system.
After you clone the repo with git clone https://github.com/duzun/hQuery.php.git
,
you have several options to start a web-server.
Option 1:
cd hQuery.php/examples php -S localhost:8000 # open browser http://localhost:8000/
Option 2 (browser-sync):
This option starts a live-reload server and is good for playing with the code.
npm install
gulp
# open browser http://localhost:8080/
Option 3 (VSCode):
If you are using VSCode, simply open the project and run debugger (F5
).
🔧 TODO
- Unit tests everything
- Document everything
Cookie support(implemented in mem for redirects)Improve selectors to be able to select by attributes- Add more selectors
- Use HTTPlug internally
💖 Support my projects
I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub).
If you like what I'm doing and this project helps you reduce time to develop, please consider to:
- ★ Star and Share the projects you like (and use)
- ☕ Give me a cup of coffee - PayPal.me/duzuns (contact at duzun.me)
- ₿ Send me some Bitcoin at this addres:
bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa
(or using the QR below)