nws/ultra-parser

Laravel package for easy scraping web pages

2.2.3 2019-05-06 10:06 UTC

README

This readme version is deprecated... New version is not available as public

This is laravel package which makes websites parsing very easy. Parse everything from everywhere, with creating only config file.

Composer:

composer require nws/ultra-parser

Usage

1. Publish configs

php artisan vendor:publish --tag="ultra-parser"

After this command, ultra-parser's configs will be published in your /config folder.

Check it and make your changes before running installation.

2. Installation

php artisan ultra-parser:install

This command will make models from your config file and run migrations.

3. Run link parser

Available arguments.

ArgumentTypeRequiredDescription
--keystring+Site key from config file
--frominteger+Start parse from page number
--tointeger+Parse up to the page number
--configstring-Config file name
--forceempty-Use this for parsing already parsed pages
--threadsinteger-Count of threads
--timeoutinteger-Timeout between page parsing
--debugempy-Print result first parsed result

Example

php artisan ultra-parser:links --key=ivi --from=0 --to=10 --threads=5 --force

4. Run data parser

Available arguments.

ArgumentTypeRequiredDescription
--keystring+Site key from config file
--countinteger-Count of links to parse data from, for each thread (1000)
--configstring-Config file name
--forceempty-Use this for parsing already parsed pages
--threadsinteger-Count of threads
--timeoutinteger-Timeout between page parsing
--debugempy-Print result first parsed result

Example

php artisan ultra-parser:data --key=ivi --threads=5 --count=100 --force

EXPLAIN

>Run data parser for site config with key ivi.

>Run with 5 threads, and select links where status in (0, 2) with limit 100 for each thread (500 links will be parsed at the end).

Configuration

Configuration file example

<?php

return [
    'tables' => [
      'links' => 'parsed_links',
      'data' => 'parsed_data',
    ],
    
    'models' => [
        'links' => 'App\Models\ParsedLink',
        'data' => 'App\Models\ParsedData'    
    ],
    
    'sites' => [
        'ivi' => [ // Key
            'main_url' => 'https://www.ivi.ru/', // Site url
            
            'links' => [ // Links parsing rules
                'type' => 'page', // Type (can be 'page' or 'from_to'
                'rules' => [ // Rules
                    'url' => 'https://www.ivi.ru/movies/', // main link
                    'page_param' => 'page{$page}', // page param pattern
                    'parse_rules' => [ // parsing rules
                        'type' => 'html',
                        'selector' => '.poster-badge > a', // Selector of element
                        'content' => 'href' // Concrete attribute from where to get link
                    ]
                ]
            ],
            
            'data' => [ // Data parsing rules
                'image' => [ // Key
                    'type' => 'html',
                    'selector' => 'video-info', // Selector of element
                    'content' => 'data-poster' // Concrete attribute from where to get data
                ],
                'title' => [
                    'type' => 'html',
                    'selector' => 'video-info',
                    'content' => 'data-title'
                ],
                'description' => [ // Key
                    'type' => 'html',
                    'selector' => '.description[itemprop="description"] p', // Selector of element
                    'content' => '#text',
                    'extra' => 'implode' //
                ]
            ]
        ]
    ],
    
    'timeout' => 5, // Sleep after parsing 1 page
    
    'proxy_servers' => [] // Proxy ips
];

There are 3 global types of rules, html,regexp, json and collector.

For type html use selector.

For type regexp use pattern.

For type json use key_chain and main_key.

content rule can be simple with attribute name or text node, but more advanced configs available too.

For example, you want parse and get <p>'s and <h3>'s text nodes as data.

<div class="main_container">
    <span>I don't need this</span>
    <span>I don't need this</span>
    <span>I don't need this</span>
    <p>I NEED THIS</p>
    <span>I don't need this</span>
    <h3>I NEED THIS TOO</h3>
</div>

Your config must look like this

'type' => 'collector',
'from' => [
    'type' => 'html',
    'selector' => '.main_container'
],
'collect' => [
        [
            'type' => 'html',
            'selector' => 'p',
            'content' => '#text'
        ],
        [
            'type' => 'html',
            'selector' => 'h3',
            'content' => '#text'
        ]
],
'extra' => 'implode'

Or if your HTML looks like

<div class="main_container">
    <span>I don't need this</span>
    <span>I don't need this</span>
    <span>I don't need this</span>
    <p>I NEED THIS</p>
    <span>I don't need this</span>
    <h3 data-content="I NEED THIS TOO"></h3>
</div>

Your config must look like this

'type' => 'collector'
'from' => [
    'type' => 'html',
    'selector' => '.main_container'
],
'collect' => [
    [
        'type' => 'html',
        'selector' => 'p',
        'content' => '#text' // textContent
    ],
    [
        'type' => 'html'
        'selector' => 'h3',
        'content' => 'data-content' //getAttribute('data-content')
    ]
]

Config example for parsing with regexp

Config

'data' => [
    'image' => [
        'type' => 'regexp',
        'pattern' => '/<img\s+class=\"image\"\s+src=\"(.*)?\"/',
        'content' => 1 //Key from preg_match_all results
    ]
]

Config example for parsing links from json

Config

'links' => [ // Links parsing rules
    'type' => 'from_to',
    'rules' => [
        'url' => 'https://api.ivi.ru/mobileapi/catalogue/v5/?sort=pop&genre_operator=and&category=14&fields=id&app_version=2268&session=cc59e9105793659549597370_1567009192-0Y-qrstvPnybiFr6esx6glw',
        'from' => 'from={$from}',
        'to' => 'to={$to}',
        'parse_rules' => [
            'type' => 'json',
            'key_chain' => 'result',
            'main_key' => 'id',
            'value_prefix' => '/watch/'
        ]
    ]
]

How to test without changing config file every time?

You can give your test config file name as --config argument to parser command.

Create file app/parser_configs/ivi_parser.php with content

<?php
 
return [
    'ivi' => [
        'main_url' => 'https://www.ivi.ru/', // Site url
        'links' => [ // Links parsing rules
            'type' => 'page', // Type (can be 'page' or 'from_to'
            'rules' => [ // Rules
                'url' => 'https://www.ivi.ru/movies/', // main link
                'page_param' => 'page{$page}', // page param pattern
                'parse_rules' => [ // parsing rules
                    'type' => 'html',
                    'selector' => '.poster-badge > a', // Selector of element
                    'content' => 'href' // Concrete attribute from where to get link
                ]
            ]
        ],
        'data' => [ // Data parsing rules
            'image' => [ // Key
                'type' => 'html',
                'selector' => 'video-info', // Selector of element
                'content' => 'data-poster' // Concrete attribute from where to get data
            ],
            'title' => [ // Key
                'type' => 'html',
                'selector' => 'video-info', // Selector of element
                'content' => 'data-title' // Concrete attribute from where to get data
            ],
            'description' => [ // Key
                'type' => 'html',
                'selector' => '.description[itemprop="description"] p', // Selector of element
                'content' => '#text',
                'extra' => 'implode'
            ]
        ]
    ]
];

And run command like

php artisan ultra-parser:data --config=app/parser_configs/ivi_parser.php --key=ivi

Extra field valid values

  • trim: clean whitespaces
  • strip_tags: strip html tags
  • implode: to concat results in one string

Important!!!

After changing main config file don't forget to run >>php artisan config:clear