nws / ultra-parser
Laravel package for easy scraping web pages
This package's canonical repository appears to be gone and the package has been frozen as a result.
Requires
- php: >=7.1
- electrolinux/phpquery: 0.9.*
- guzzlehttp/guzzle: 6.*
- laravel/framework: ~5.2|~5.2.32|~5.3.0|~5.4.0|~5.5.0|~5.6.0|~5.7.0
This package is auto-updated.
Last update: 2020-06-10 17:41:51 UTC
README
This readme version is deprecated... New version is not available as public
This is laravel package which makes websites parsing very easy. Parse everything from everywhere, with creating only config file.
Composer:
composer require nws/ultra-parser
Usage
1. Publish configs
php artisan vendor:publish --tag="ultra-parser"
After this command, ultra-parser
's configs will be published in your /config
folder.
Check it and make your changes before running installation.
2. Installation
php artisan ultra-parser:install
This command will make models
from your config
file and run migrations.
3. Run link parser
Available arguments.
Argument | Type | Required | Description |
---|---|---|---|
--key | string | + | Site key from config file |
--from | integer | + | Start parse from page number |
--to | integer | + | Parse up to the page number |
--config | string | - | Config file name |
--force | empty | - | Use this for parsing already parsed pages |
--threads | integer | - | Count of threads |
--timeout | integer | - | Timeout between page parsing |
--debug | empy | - | Print result first parsed result |
Example
php artisan ultra-parser:links --key=ivi --from=0 --to=10 --threads=5 --force
4. Run data parser
Available arguments.
Argument | Type | Required | Description |
---|---|---|---|
--key | string | + | Site key from config file |
--count | integer | - | Count of links to parse data from, for each thread (1000) |
--config | string | - | Config file name |
--force | empty | - | Use this for parsing already parsed pages |
--threads | integer | - | Count of threads |
--timeout | integer | - | Timeout between page parsing |
--debug | empy | - | Print result first parsed result |
Example
php artisan ultra-parser:data --key=ivi --threads=5 --count=100 --force
EXPLAIN
>Run data parser for site config with key ivi
.
>Run with 5 threads, and select links where status
in (0, 2) with limit 100 for each thread (500 links will be parsed at the end).
Configuration
Configuration file example
<?php
return [
'tables' => [
'links' => 'parsed_links',
'data' => 'parsed_data',
],
'models' => [
'links' => 'App\Models\ParsedLink',
'data' => 'App\Models\ParsedData'
],
'sites' => [
'ivi' => [ // Key
'main_url' => 'https://www.ivi.ru/', // Site url
'links' => [ // Links parsing rules
'type' => 'page', // Type (can be 'page' or 'from_to'
'rules' => [ // Rules
'url' => 'https://www.ivi.ru/movies/', // main link
'page_param' => 'page{$page}', // page param pattern
'parse_rules' => [ // parsing rules
'type' => 'html',
'selector' => '.poster-badge > a', // Selector of element
'content' => 'href' // Concrete attribute from where to get link
]
]
],
'data' => [ // Data parsing rules
'image' => [ // Key
'type' => 'html',
'selector' => 'video-info', // Selector of element
'content' => 'data-poster' // Concrete attribute from where to get data
],
'title' => [
'type' => 'html',
'selector' => 'video-info',
'content' => 'data-title'
],
'description' => [ // Key
'type' => 'html',
'selector' => '.description[itemprop="description"] p', // Selector of element
'content' => '#text',
'extra' => 'implode' //
]
]
]
],
'timeout' => 5, // Sleep after parsing 1 page
'proxy_servers' => [] // Proxy ips
];
There are 3 global types of rules, html
,regexp
, json
and collector
.
For type html
use selector
.
For type regexp
use pattern
.
For type json
use key_chain
and main_key
.
content
rule can be simple with attribute name or text node, but more advanced configs available too.
For example, you want parse and get <p>
's and <h3>
's text nodes as data.
<div class="main_container">
<span>I don't need this</span>
<span>I don't need this</span>
<span>I don't need this</span>
<p>I NEED THIS</p>
<span>I don't need this</span>
<h3>I NEED THIS TOO</h3>
</div>
Your config must look like this
'type' => 'collector',
'from' => [
'type' => 'html',
'selector' => '.main_container'
],
'collect' => [
[
'type' => 'html',
'selector' => 'p',
'content' => '#text'
],
[
'type' => 'html',
'selector' => 'h3',
'content' => '#text'
]
],
'extra' => 'implode'
Or if your HTML
looks like
<div class="main_container">
<span>I don't need this</span>
<span>I don't need this</span>
<span>I don't need this</span>
<p>I NEED THIS</p>
<span>I don't need this</span>
<h3 data-content="I NEED THIS TOO"></h3>
</div>
Your config must look like this
'type' => 'collector'
'from' => [
'type' => 'html',
'selector' => '.main_container'
],
'collect' => [
[
'type' => 'html',
'selector' => 'p',
'content' => '#text' // textContent
],
[
'type' => 'html'
'selector' => 'h3',
'content' => 'data-content' //getAttribute('data-content')
]
]
Config example for parsing with regexp
Config
'data' => [
'image' => [
'type' => 'regexp',
'pattern' => '/<img\s+class=\"image\"\s+src=\"(.*)?\"/',
'content' => 1 //Key from preg_match_all results
]
]
Config example for parsing links from json
Config
'links' => [ // Links parsing rules
'type' => 'from_to',
'rules' => [
'url' => 'https://api.ivi.ru/mobileapi/catalogue/v5/?sort=pop&genre_operator=and&category=14&fields=id&app_version=2268&session=cc59e9105793659549597370_1567009192-0Y-qrstvPnybiFr6esx6glw',
'from' => 'from={$from}',
'to' => 'to={$to}',
'parse_rules' => [
'type' => 'json',
'key_chain' => 'result',
'main_key' => 'id',
'value_prefix' => '/watch/'
]
]
]
How to test without changing config file every time?
You can give your test config file name as --config
argument to parser command.
Create file app/parser_configs/ivi_parser.php
with content
<?php
return [
'ivi' => [
'main_url' => 'https://www.ivi.ru/', // Site url
'links' => [ // Links parsing rules
'type' => 'page', // Type (can be 'page' or 'from_to'
'rules' => [ // Rules
'url' => 'https://www.ivi.ru/movies/', // main link
'page_param' => 'page{$page}', // page param pattern
'parse_rules' => [ // parsing rules
'type' => 'html',
'selector' => '.poster-badge > a', // Selector of element
'content' => 'href' // Concrete attribute from where to get link
]
]
],
'data' => [ // Data parsing rules
'image' => [ // Key
'type' => 'html',
'selector' => 'video-info', // Selector of element
'content' => 'data-poster' // Concrete attribute from where to get data
],
'title' => [ // Key
'type' => 'html',
'selector' => 'video-info', // Selector of element
'content' => 'data-title' // Concrete attribute from where to get data
],
'description' => [ // Key
'type' => 'html',
'selector' => '.description[itemprop="description"] p', // Selector of element
'content' => '#text',
'extra' => 'implode'
]
]
]
];
And run command like
php artisan ultra-parser:data --config=app/parser_configs/ivi_parser.php --key=ivi
Extra field valid values
- trim: clean whitespaces
- strip_tags: strip html tags
- implode: to concat results in one string
Important!!!
After changing main config file don't forget to run >>
php artisan config:clear