dachcom-digital / dynamic-search-data-provider-crawler
Installs: 22 314
Dependents: 0
Suggesters: 0
Security: 0
Stars: 8
Watchers: 9
Forks: 7
Open Issues: 1
Type:dynamic-search-provider-bundle
Requires
- dachcom-digital/dynamic-search: ^3.0 || ^4.0
- pimcore/pimcore: ^11.0
- vdb/php-spider: ^0.7
Requires (Dev)
- codeception/codeception: ^5.0
- codeception/module-symfony: ^3.1
- phpstan/phpstan: ^1.0
- phpstan/phpstan-symfony: ^1.0
- symplify/easy-coding-standard: ^9.0
README
A spider crawler extension for Pimcore Dynamic Search.
Caution
This Connector has reached its end of life and only receives compatibility update. It will not be developed further. Use the Trinity Data Provider instead!
Release Plan
Release | Supported Pimcore Versions | Supported Symfony Versions | Release Date | Maintained | Branch |
---|---|---|---|---|---|
3.x | 11.0 |
^6.4 |
28.09.2023 | Feature Branch | master |
2.x | 10.0 - 10.6 |
^5.4 |
19.12.2021 | No | 2.x |
1.x | 6.6 - 6.9 |
^4.4 |
18.04.2021 | No | 1.x |
Installation
"require" : { "dachcom-digital/dynamic-search" : "~3.0.0", "dachcom-digital/dynamic-search-data-provider-crawler" : "~3.0.0" }
Dynamic Search Bundle
You need to install / enable the Dynamic Search Bundle first. Read more about it here. After that, proceed as followed:
Add Bundle to bundles.php
:
<?php return [ \DsWebCrawlerBundle\DsWebCrawlerBundle::class => ['all' => true], ];
Basic Setup
dynamic_search: context: default: data_provider: service: 'web_crawler' options: always: own_host_only: true full_dispatch: seed: 'http://your-domain.test' valid_links: - '@^http://your-domain.test.*@i' user_invalid_links: - '@^http://your-domain.test\/members.*@i' single_dispatch: host: 'http://your-domain.test.test' normalizer: service: 'web_crawler_localized_resource_normalizer'
Provider Options
always
Name | Default Value | Description |
---|---|---|
own_host_only |
false | |
allow_subdomains |
false | |
allow_query_in_url |
false | |
allow_hash_in_url |
false | |
allowed_mime_types |
['text/html', 'application/pdf'] | |
allowed_schemes |
['http'] | |
content_max_size |
0 |
full_dispatch
Name | Default Value | Description |
---|---|---|
seed |
null | |
valid_links |
[] | |
user_invalid_links |
[] | |
max_link_depth |
15 | |
max_crawl_limit |
0 |
single_dispatch
Name | Default Value | Description |
---|---|---|
host |
null |
Resource Normalizer
DefaultResourceNormalizer
Identifier: web_crawler_default_resource_normalizer
Normalize simple documents
Options: none
LocalizedResourceNormalizer
Identifier: web_crawler_localized_resource_normalizer
Scaffold localized documents
Options:
Name | Default Value | Allowed Type | Description |
---|---|---|---|
locales |
all pimcore enabled languages | array | |
skip_not_localized_documents |
true | bool | if false, an exception rises if a document/object has no valid locale |
Transformer
Scaffolder
HttpResponseHtmlDataScaffolder
Identifier: http_response_html_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource
with content-type text/html
.
HttpResponsePdfDataScaffolder
Identifier: http_response_pdf_scaffolder
Simple object scaffolder.
Supported types: VDB\Spider\Resource
with content-type application/pdf
.
PimcoreElementScaffolder
Identifier: pimcore_element_scaffolder
Simple object scaffolder.
Supported types: Asset
, Document
, DataObject\Concrete
.
Field Transformer
UriExtractor
Identifier: resource_uri_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
LanguageExtractor
Identifier: resource_language_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
MetaExtractor
Identifier: resource_meta_extractor
Supported Scaffolder: http_response_html_scaffolder
Return Type: string|null
Options:
Name | Default Value | Allowed Type | Description |
---|---|---|---|
name |
null | string | The name of the meta tag to fetch the value from |
HtmlTagExtractor
Identifier: resource_html_tag_content_extractor
Supported Scaffolder: http_response_html_scaffolder
Return Type: string|null
Options: none
TextExtractor
Identifier: resource_text_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Name | Default Value | Allowed Type | Description |
---|---|---|---|
content_start_indicator |
<!-- main-content --> |
string | Marks the begin of the indexable page content |
content_end_indicator |
<!-- /main-content --> |
string | Marks the end of the indexable page conten |
content_exclude_start_indicator |
null | null|string | Marks the begin of the text to be excluded from indexing |
content_exclude_end_indicator |
null | null|string | Marks the end of the text to be excluded from indexing |
TitleExtractor
Identifier: resource_title_extractor
Supported Scaffolder: http_response_html_scaffolder
, http_response_pdf_scaffolder
Return Type: string|null
Options: none
Copyright and License
Copyright: DACHCOM.DIGITAL
For licensing details please visit LICENSE.md
Upgrade Info
Before updating, please check our upgrade notes!