ttree / contentinsight
Tools to extract basic content inventory information from an existing website
Installs: 289
Dependents: 0
Suggesters: 0
Security: 0
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Type:typo3-flow-package
Requires
This package is auto-updated.
Last update: 2024-11-26 05:01:37 UTC
README
![Build Status](http://gitlab.ttree.ch:8080/buildStatus/icon?job=OSS ContentInsight Master Commit)
This TYPO3 Flow package provider a CLI tools to extract Content Inventory CSV from existing website.
This package is under development and considered beta. This package require Flow 2.3.
Features
- Extract website structure and basic meta data
- Support crawling presets
- Flexible report building (include a CSV report builder, but you can register your own report builder)
- Skip URI with regular expression
- Sort inventory based on document tree structure
Todos
- Generate human readable page ID (like, 1, 1.1, 1.2, 2, 2.1, 2.2, ...)
- Update report / multiple index support
- Get analytics data from Google Analytics
Configuration
Check the Configuration/Settings.yaml
for detailed configurations.
By default, this package cache all Raw HTTP request for one day. You can change this settings in you own
Settings.yaml
and Caches.yaml
.
Base Preset
The base preset is automatically merged with all preset. You can enabled or disabled any property with the settings
presets.[preset_name].properties.[property_name].enabled
.
Ttree: ContentInsight: presets: '*': properties: 'pageTitle': enabled: TRUE 'navigationTitle': enabled: TRUE
Custom Preset
You can define custom preset to crawle different kind of informations. With the class
setting you can use your
own processor implementation to get information from the current URI. Your processor must implement
Ttree\ContentInsight\CrawlerProcessor\ProcessorInterface
:
Ttree: ContentInsight: presets: 'custom': properties: 'pageTitle': class: 'Your\Package\CrawlerProcessor\PageTitleProcessor' 'metaDescription': enabled: TRUE 'metaKeywords': enabled: TRUE 'firstLevelHeader': enabled: TRUE
How to build a report ?
The package support CSV reporting, but you can register your own Report builder. Check the Settings.yaml
:
Ttree: ContentInsight: presets: 'custom': reportConfigurations: 'csv': enabled: TRUE renderType: 'Csv' renderTypeOptions: displayColumnHeaders: TRUE reportPath: '%FLOW_PATH_DATA%Reports/Ttree.ContentInsight' reportPrefix: 'content-inventory-report' properties: 'id': label: 'ID' 'pageTitle': label: 'Page Title' 'navigationTitle': label: 'Navigation Title' 'externalLink': label: 'External Link' postProcessor: 'Boolean' 'currentUri': label: 'URL' 'metaDescription': label: 'Meta Description' 'metaKeywords': label: 'Meta Keywords' 'firstLevelHeaderCount': label: 'Main Header Count (H1)' 'firstLevelHeaderContent': label: 'Main Header Content (H1)' 'remark': label: 'Crawling Remark'
The keys in the properties
section must match the key produced by the CrawlerProcessor
object.
The position of each column could be specified with the following syntax : position: '<position-string>'
The <position-string>
supports one of the following syntax:
start (<weight>)
end (<weight>)
before <key> (<weight>)
after <key> (<weight>)
<numerical-order>
Example
Ttree: ContentInsight: presets: 'custom': reportConfigurations: 'csv': enabled: TRUE renderType: 'Csv' renderTypeOptions: displayColumnHeaders: TRUE reportPath: '%FLOW_PATH_DATA%Reports/Ttree.ContentInsight' reportPrefix: 'content-inventory-report' properties: 'id': label: 'ID' position: '<position-string>', 'pageTitle': label: 'Page Title' position:'<position-string>'
For a single crawling preset you can register multiple reports if required. Foreach property you can register a post
processor if you need to manipulate the property in the report, see BooleanPostProcessor
for a basic example.
How to skip specific URI ?
You can define invalid URIs patterns in your crawling presets:
Ttree: ContentInsight: presets: 'custom': invalidUriPatterns: 'javascript': pattern: '@^javascript\:void\(0\)$@' 'mailto': pattern: '@^mailto\:.*@' 'anchor': pattern: '@^#.*@' message: 'Link to anchor'
If the pattern has a message
all URL matching the pattern will be logged. By default the crawler skip
those URLs silently.
Usage
To get the complete website inventory:
# flow contentinventor:extract --base-url http://www.domain.com
Or to limit the crawler to a part of the website
# flow contentinventor:extract --base-url http://www.domain.com/products
You can select a crawling presets
# flow contentinventor:extract --base-url http://www.domain.com/products --preset default