scribilicious / tinylistscraper
TinyListScraper is an easy to use page list scraper. Its main goal is to scrape news and search results.
Requires
- php: >=5.6
This package is auto-updated.
Last update: 2025-06-06 11:51:22 UTC
README
TinyListScraper is an easy to use page list scraper. Its main goal is to scrape news and search results. Some features are:
- Caching of results.
- Easy to maintain and expand scrapers.
- Results contain a set of fields containing: title, link, image, text, intro text, author, source, date and tags.
Requirements
TinyListScraper 1.0 requires PHP ^5.6 || ~7.0.0 || ~7.1.0 || ~7.2.0
. PHP DOMDocument
and DOMXPath
extensions have to be loaded.
Usage
The simplest usage (since version 7.0) of the library would be as follows:
<?php require_once __DIR__ . '/src/TinyListScraper.php'; // Load all results from all the scrapers $oScraper = new TinyListScraper\TinyListScraper(); $aResults = $oScraper->get('search some text'); print_r($aResults); // Load just the results from verge $aResults = $oScraper->get('search some text', array('verge')); print_r($aResults);
Creating a new Scraper
All the scraper setting files are located in src/scrapers/
. To create a new one copy an existing scraper or create a new file. The format has to be [name].inc
.
A scraper setting file looks as followed:
<?php $aSettings = array( 'url' => 'https://techcrunch.com/search/{$SEARCH}', 'source' => 'techcrunch.com', 'list' => array( 'container' => array( 'selector' => '//div[contains(@class,"post-block post-block")]' ), 'link' => array( 'selector' => './/a[contains(@class,"post-block__title__link")]', 'attribute' => 'href' ) ), 'content' => array( 'title' => array( 'selector' => './/meta[@name="sailthru.title"]', 'attribute' => 'content' ), 'author' => array( 'selector' => './/meta[@name="sailthru.author"]', 'attribute' => 'content' ), 'date' => array( 'selector' => './/meta[@name="sailthru.date"]', 'attribute' => 'content' ), 'description' => array( 'selector' => './/meta[@property="og:description"]', 'attribute' => 'content' ), 'tags' => array( 'selector' => './/meta[@name="sailthru.tags"]', 'attribute' => 'content' ), 'image' => array( 'selector' => './/meta[@name="sailthru.image.full"]', 'attribute' => 'content' ), 'thumb' => array( 'selector' => './/meta[@name="sailthru.image.thumb"]', 'attribute' => 'content' ), 'text' => array( 'selector' => './/div[contains(@class,"article-content")]/p' ) ) );
Some important notes:
{$SEARCH} should be in the place of the URL where the search string occures.
The DOMXPath
format is being used for finding elements. For more information check http://php.net/manual/de/class.domxpath.php
Current Scrapers
- Mashable
- Lifehacker
- Verge
- Gizmodo
- Techcrunch