wykleph/html-scraper

An API for taking json sitemaps generated by the webscraper.io extension, and emulating webscraper.io's scraping behavior.

v0.1.0 2016-02-04 20:25 UTC

This package is not auto-updated.

Last update: 2024-04-13 16:29:15 UTC


README

An API for taking json sitemaps generated by the webscraper.io extension, and emulating webscraper.io's scraping behavior in PHP.

This is great for creating scraping templates in no time at all..

I have no affiliation with webscraper.io, so please refer to their documentation and their forums for anything you might need in regards to webscraper.io.

Installation : composer require wykleph/html-scraper

Note: Child selectors are not supported yet, but it's on the docket!

To use, require this project with composer, then download the webscraper.io extension for chrome. This is what we will use to generate our sitemap for crawling the html.

Once you have the webscraper.io extension, you will probably want to learn how to use the webscraper.io extension.

Once you have some selectors set up for your sitemap, click on Sitemap (sitemap-name)->Export Sitemap. The json output is what we will use to instantiate a SiteMap object:

$SiteMap = new SiteMap($json);

The next step is to instantiate a HtmlScraper object to consume the SiteMap and the HTML you would like to crawl:

$scraper = new HtmlScraper($SiteMap, $html);
$selections = $scraper->getSelections();

or:

$selections = new HtmlScraper($SiteMap, $html)->getSelections();

The $selections array now contains all of the selections for the sitemap that you used for the given html.

The $selections array should also contain the name of the selector that you set up with webscraper.io as the key, so accessing your selections is as easy as grabbing something like $selections['username-field-name'] or $selections['phone'].