juanparati / phpscraper
A PHP scraper
Requires
- ext-dom: *
- ext-json: *
- amphp/amp: ^2.3
- amphp/artax: ^3.0
- league/climate: ^3.5
- symfony/dom-crawler: ^4.3
- symfony/yaml: ^4.3
This package is auto-updated.
Last update: 2024-11-18 07:34:29 UTC
README
1. What is it?
A command line tool used for extract and format content from webpages. It's suitable for extract like:
- product catalogs
- reviews
- lists
- etc
The output is formatted as JSON lines.
2. How it works?
-
Create a scraper receipt (see recipes)
-
Type:
phpscraper config url
The following example will extract all the reviews with the user name, comment and rating from Amazon:
phpscraper recipes/amazon.yml https://www.amazon.de/product-reviews/B000J34HN4/ref=acr_dpx_hist_3?ie=UTF8
For see additional options just type:
phpscraper --help
3. Recipes
Recipes are YML files that describe in a structure way how to extract the content from the pages. The recipes uses XPath routes in order to instruct which elements are extracted.
Example of recipe that extract comments from Amazon reviews:
project: "Amazon reviews extractor"
pagination:
next_xpath: "//li[@class='a-last']/a/@href"
extraction:
product:
xpath: "//h1/a[@class='a-link-normal']"
extract_as: "product"
in_memory: true
comments:
xpath: "//div[@class='a-section celwidget']"
subelements:
product:
from_memory: "product"
extract_as: "product"
name:
xpath: "//span[@class='a-profile-name']"
extract_as: "name"
rate:
xpath: "//i[contains(@class, 'a-icon-star')]/span"
extract_as: "rating"
extract_regex: "/^.{0,3}/"
cast_as: float
comment:
xpath: "//span[@class='a-size-base review-text review-text-content']"
extract_as: "comment"
verified:
xpath: "//span[@class='a-size-mini a-color-state a-text-bold']"
extract_as: "verified"
cast_as: boolean
3.1 The pagination section
It defines where the "next page" button is located. In case that this element is not found then scraper then it will finish the process until the current page is extracted.
3.2 The extraction section
It defines which elements are going to be extracted. It uses a cascade structure so its possible to define the parent and child elements.
The possible instructions for the extraction section are:
- xpath: It defines the element that is going to be extracted using a XPath expression. The expression will become automatically relative to the parent element if exists.
- subelements: It defines the child elements.
- extract_as: This is always required and it sets the field name of the element to extract.
- extract_regex: Regular expression used for extract the extracted content.
- cast_as: Cast the extracted data, so the JSON output will reflect the right data type. Possible casts are "boolean", "int", "float" and "string"
- in_memory: It used when we want to save temporally in the memory so we can use for example inside of another extraction thread.
5. Installation
PHPscraper can be installed in different ways:
A) Download the last build from Github or B) Just type "composer global require juanparati/phpscraper"
5. How to build my own package:
-
Download Caveman (The Mamuph Helper Tool)
-
Clone this project
-
Inside the project directory type:
caveman build . -x -r