josephbergevin/scraper-more-faster

There is no license information available for the latest version (dev-master) of this package.

A PHP Web Scraper - Designed for Speed

dev-master 2013-12-16 16:50 UTC

This package is not auto-updated.

Last update: 2024-04-22 13:11:38 UTC


README

ScraperMoreFaster is a PHP class built to scrape the content of a webpage faster than SimpleHTMLDOM (by SourceForge). It came about when I needed a faster scraper solution for a web crawler. SimpleHTMLDOM is a wonderful parser, and very robust in it's feature set. But unfortunately too slow for crawler purposes, where every millisecond counts.

Setup

The ScraperMoreFaster.php file found in the lib folder is the only file needed to use this class.

Usage

Defining the HTML to be Parsed

$scraper_more_faster = new ScraperMoreFaster;

To define the HTML file from a URL:

$scraper_more_faster->file_get_html($url);

This will define the HTML by using the file_get_contents php command to pull in the HTML from the given URL.

To define the HTML file from a string:

$scraper_more_faster->str_get_html($html_str);

This will define the HTML simply from the string passed in the $html_str var.

Scrape PlainText from page

My biggest purpose for creating this class was for the PlainText functionality. In speed tests, I found the plaintext functionality to be dozens of times faster than SimpleHTMLDOM's plaintext functionality. And in all comparison tests, the plaintext from each tool was 99% - 100% similar.

To run this command (after defining the HTML as desribed above):

$scraper_more_faster->plaintext();

Examples

See smf_tester.php for example usage.