dachcom-digital/dynamic-search-data-provider-crawler

v3.0.1 2023-12-14 09:14 UTC

This package is auto-updated.

Last update: 2024-12-19 12:54:30 UTC


README

Software License Latest Release Tests PhpStan

A spider crawler extension for Pimcore Dynamic Search.

Release Plan

Installation

"require" : {
    "dachcom-digital/dynamic-search" : "~3.0.0",
    "dachcom-digital/dynamic-search-data-provider-crawler" : "~3.0.0"
}

Dynamic Search Bundle

You need to install / enable the Dynamic Search Bundle first. Read more about it here. After that, proceed as followed:

Add Bundle to bundles.php:

<?php

return [
    \DsWebCrawlerBundle\DsWebCrawlerBundle::class => ['all' => true],
];

Basic Setup

dynamic_search:
    context:
        default:
            data_provider:
                service: 'web_crawler'
                options:
                    always:
                        own_host_only: true
                    full_dispatch:
                        seed: 'http://your-domain.test'
                        valid_links:
                            - '@^http://your-domain.test.*@i'
                        user_invalid_links:
                            - '@^http://your-domain.test\/members.*@i'
                    single_dispatch:
                        host: 'http://your-domain.test.test'
                normalizer:
                    service: 'web_crawler_localized_resource_normalizer'

Provider Options

always

full_dispatch

single_dispatch

Resource Normalizer

DefaultResourceNormalizer

Identifier: web_crawler_default_resource_normalizer Normalize simple documents Options: none

LocalizedResourceNormalizer

Identifier: web_crawler_localized_resource_normalizer Scaffold localized documents

Options:

Transformer

Scaffolder

HttpResponseHtmlDataScaffolder

Identifier: http_response_html_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource with content-type text/html.

HttpResponsePdfDataScaffolder

Identifier: http_response_pdf_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource with content-type application/pdf.

PimcoreElementScaffolder

Identifier: pimcore_element_scaffolder
Simple object scaffolder.
Supported types: Asset, Document, DataObject\Concrete.

Field Transformer

UriExtractor

Identifier: resource_uri_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null
Options: none

LanguageExtractor

Identifier: resource_language_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null Options: none

MetaExtractor

Identifier: resource_meta_extractor
Supported Scaffolder: http_response_html_scaffolder

Return Type: string|null Options:

HtmlTagExtractor

Identifier: resource_html_tag_content_extractor
Supported Scaffolder: http_response_html_scaffolder

Return Type: string|null Options: none

TextExtractor

Identifier: resource_text_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null

TitleExtractor

Identifier: resource_title_extractor
Supported Scaffolder: http_response_html_scaffolder, http_response_pdf_scaffolder

Return Type: string|null Options: none

Copyright and License

Copyright: DACHCOM.DIGITAL
For licensing details please visit LICENSE.md

Upgrade Info

Before updating, please check our upgrade notes!