talan-hdf / semantic-suggestion
TYPO3 extension for suggesting semantically related pages
Installs: 2 055
Dependents: 0
Suggesters: 0
Security: 0
Stars: 10
Watchers: 4
Forks: 5
Open Issues: 0
Type:typo3-cms-extension
pkg:composer/talan-hdf/semantic-suggestion
Requires
- cywolf/nlp-tools: ^1.0||dev-main
- typo3/cms-core: ^12.0 || ^13.0
- typo3/cms-scheduler: ^12.0 || ^13.0
- dev-master
- 3.2.0
- 3.1.1
- 3.1.0
- 2.3.1
- 2.3.0
- 2.2.0
- 2.1.1
- 2.1.0
- 2.0.0
- 1.5.2.x-dev
- 1.5.1.x-dev
- 1.5.0.x-dev
- 1.4.0
- 1.3.2.x-dev
- 1.3.2
- 1.3.1.x-dev
- 1.3.1
- 1.3.0.x-dev
- 1.3.0
- 1.2.0
- 1.1.0
- 1.1.0-beta
- 1.0.9
- 1.0.8
- 1.0.7
- 1.0.6
- 1.0.5
- 1.0.4
- 1.0.3
- 1.0.1
- dev-test_debug
- dev-v12compatibility
- dev-bugfix/backend-module
- dev-nlp_1_5_0
- dev-feature/architecture-refactor
- dev-affinageCalculs
- dev-news
- dev-branch110
- dev-cleaning
- dev-phpnlp
- dev-nlp
- dev-nlpGoogle
- dev-recencyWeight
- dev-UnitTests
- dev-module
- dev-talan
This package is auto-updated.
Last update: 2025-10-09 07:49:31 UTC
README
Elevate your TYPO3 website with intelligent, AI-powered content recommendations using advanced NLP technology.
Introduction
The Semantic Suggestion extension revolutionizes the way related content is presented on TYPO3 websites. Moving beyond traditional category-based functionalities, this extension employs advanced NLP (Natural Language Processing) and TF-IDF (Term Frequency-Inverse Document Frequency) analysis to create genuinely relevant content connections.
Since version 2.0.0, similarity scores are stored in a dedicated database table (tx_semanticsuggestion_similarities) instead of the TYPO3 cache. A Scheduler task handles calculation and storage, ensuring persistence and performance.
๐ NEW in v3.0.0: Enhanced with nlp_tools integration for professional-grade text analysis, featuring intelligent language detection for TYPO3 12/13 multilingual sites and optimized German language support.
Key Benefits:
- ๐ฏ Highly Relevant Links: TF-IDF vectorization with advanced NLP creates genuinely relevant content connections
- ๐ Multilingual Intelligence: Smart language detection for TYPO3 12/13 sites with automatic content analysis
- ๐ฉ๐ช German Language Optimized: Professional stemming and handling of compound words (Automobilindustrie โ Automobil)
- ๐ Increased User Engagement: Keep visitors on your site longer by offering truly related content
- ๐ Semantic Cocoon: Contributes to a high-quality semantic network, enhancing SEO and navigation
- โก Intelligent Automation: Reduces manual linking work while improving internal link quality by 30-50%
Performance Considerations
- The similarity calculation process performed by the Scheduler task can take time, especially on sites with a large number of pages (>500 pages might require 30s or more depending on the server).
- Displaying suggestions and statistics (reading from the database) is optimized.
- Use the backend module to assess the performance and relevance of suggestions for your specific setup.
๐ What's New in Version 3.0.0
๐ง Advanced NLP Integration
- TF-IDF Vectorization: Professional-grade similarity calculation replacing basic cosine similarity
- Intelligent Language Detection: Hybrid approach using TYPO3 configuration + content analysis
- Advanced Text Processing: Stemming, stop word removal, and text normalization via nlp_tools
๐ Enhanced Multilingual Support
- TYPO3 12/13 Compatibility: Uses modern Context API and Site Configuration
- Smart Language Detection:
- Priority 1: Uses TYPO3 site language configuration (
de_DE.UTF-8โde) - Priority 2: Falls back to intelligent content analysis when TYPO3 config is uncertain
- Priority 3: Confidence scoring prevents misdetection of mixed-language content
- Priority 1: Uses TYPO3 site language configuration (
- Multi-site Ready: Supports complex multilingual TYPO3 setups
๐ฉ๐ช German Language Excellence
- Professional Stemming: Uses Wamania\Snowball for German morphology
- Compound Word Support: Recognizes relationships between "Automobil" and "Automobilindustrie"
- Umlaut Handling: Perfect support for รค, รถ, รผ, ร characters
- German Stop Words: Optimized list including der, die, das, ein, eine, mit, von, etc.
๐ Enhanced Algorithm
- TF-IDF Vectors: More accurate similarity scoring than traditional word counting
- Confidence Thresholds: Prevents false positives in language detection
- Fallback Mechanisms: Graceful degradation if NLP processing fails
- Performance Caching: TF-IDF vectors and stemming results are cached
Table of Contents
- Introduction
- Features
- Requirements
- Installation
- Language Detection System
- Configuration
- Usage (Frontend)
- Backend Module
- Scheduler Task
- Similarity Logic (TF-IDF Enhanced)
- Multilingual Sites Setup
- Display Customization
- Debugging & Performance
- Migration from v2.x
- Contributing
- License
- Support
Features
๐ง Core Functionality
- Advanced TF-IDF Analysis: Professional semantic similarity calculation using vectorization
- Database Storage: Persistent similarity scores in
tx_semanticsuggestion_similaritiestable - Automated Processing: Scheduler task for background calculation and updates
- Frontend Integration: Clean API for displaying suggestions (title, media, excerpt)
- Backend Analytics: Comprehensive statistics and performance metrics
๐ Multilingual & NLP Features
- Intelligent Language Detection: Hybrid TYPO3 + content analysis approach
- Professional Text Processing: Stemming, stop word removal, text normalization
- German Language Excellence: Optimized for compound words and umlauts
- Multi-site Support: Handles complex TYPO3 multilingual configurations
- Confidence Scoring: Prevents false language detection in mixed content
โ๏ธ Configuration & Performance
- Highly Configurable: TypoScript settings for display and Scheduler for analysis scope
- Performance Optimized: TF-IDF vector caching and intelligent fallbacks
- Flexible Exclusions: Page-level exclusions for analysis and/or display
- Debug Mode: Comprehensive logging for development and troubleshooting
Requirements
- TYPO3: 12.0.0 - 13.9.99
- PHP: 8.0 or higher
- Dependencies:
cywolf/nlp-tools(automatically installed via Composer)wamania/php-stemmer(for advanced stemming support)
Installation
Composer Installation (recommended)
- Install the extension with dependencies:
composer require talan-hdf/semantic-suggestion composer require cywolf/nlp-tools
- Activate both extensions in the TYPO3 Extension Manager.
- Clear TYPO3 cache:
./vendor/bin/typo3 cache:flush - (Optional) Run unit tests to verify installation:
./vendor/bin/phpunit --configuration phpunit.xml.dist --testsuite unit
Manual Installation
- Download the extension from the TER or GitHub.
- Upload the archive to
typo3conf/ext/. - Activate the extension in the Extension Manager.
Language Detection System
๐ง How Language Detection Works
The extension uses a hybrid approach that combines TYPO3's multilingual configuration with intelligent content analysis:
Detection Priority (Cascade System):
-
๐ฏ Priority 1: TYPO3 Site Configuration (for TYPO3 12/13)
// Uses TYPO3's Context API and Site Configuration $site = $request->getAttribute('site'); $siteLanguage = $site->getLanguageById($languageId); $locale = $siteLanguage->getLocale(); // e.g., "de_DE.UTF-8" return strtolower(substr($locale->getLanguageCode(), 0, 2)); // โ "de"
-
๐ Priority 2: Content Analysis (when TYPO3 config is uncertain)
- Extracts character trigrams from text content
- Compares against language profiles built from stop words
- Uses confidence scoring to prevent false positives
-
โ๏ธ Priority 3: Confidence Verification
// If confidence difference < 30%, trust TYPO3 context if (($firstScore - $secondScore) / $firstScore < 0.3) { return $this->getTypo3LanguageContext(); }
๐ Multilingual Site Examples
Example 1: Well-configured TYPO3 Site
# site/config.yaml languages: - languageId: 0 locale: 'en_US.UTF-8' # โ nlp_tools detects "en" - languageId: 1 locale: 'de_DE.UTF-8' # โ nlp_tools detects "de" - languageId: 2 locale: 'fr_FR.UTF-8' # โ nlp_tools detects "fr"
Result: nlp_tools uses TYPO3 configuration directly via Site API
Example 2: Mixed Content Detection
Page with:
- sys_language_uid = 0 (English)
- But content: "Die deutsche Automobilindustrie ist sehr wichtig"
Detection results:
- Content analysis: 85% German confidence
- TYPO3 context: English
- Final decision: German (high confidence overrides TYPO3)
Example 3: Uncertain Content
Detection scores: French: 45%, German: 42%
Confidence difference: (45-42)/45 = 6.7% < 30%
โ Falls back to TYPO3 language context
๐ nlp_tools Integration
Yes, the extension fully integrates with nlp_tools:
// In PageAnalysisService.php protected function getCurrentLanguage(): string { // Uses nlp_tools LanguageDetectionService $detectedLanguage = $this->languageDetector->detectLanguage(''); return $detectedLanguage; // Smart hybrid detection }
The language detection leverages:
- LanguageDetectionService: Trigram analysis and confidence scoring
- TextAnalysisService: Advanced text processing and stemming
- TextVectorizerService: TF-IDF vectorization for similarity calculation
๐ Language Support Matrix
All languages benefit from nlp_tools integration, but with different optimization levels:
๐ฅ EXPERT Level Support (Maximum Optimization)
| Language | Code | Stemming | Stop Words | TF-IDF | Improvement |
|---|---|---|---|---|---|
| ๐ฉ๐ช German | de |
โ Advanced (Wamania\Snowball) | โ Specialized | โ Full | +40-50% |
German Features:
- Professional compound word handling (
AutomobilindustrieโAutomobil) - Perfect umlaut support (
รค, รถ, รผ, ร) - Specialized stop words (
der, die, das, ein, eine, mit, von, zu...)
๐ฅ ADVANCED Level Support (Professional Optimization)
| Language | Code | Stemming | Stop Words | TF-IDF | Improvement |
|---|---|---|---|---|---|
| ๐ซ๐ท French | fr |
โ Wamania\Snowball | โ Specialized | โ Full | +30-40% |
| ๐ฌ๐ง English | en |
โ Wamania\Snowball | โ Specialized | โ Full | +30-40% |
| ๐ช๐ธ Spanish | es |
โ Wamania\Snowball | โ Specialized | โ Full | +30-40% |
๐ฅ STANDARD Level Support (TF-IDF Optimization)
| Language | Code | Stemming | Stop Words | TF-IDF | Improvement |
|---|---|---|---|---|---|
| ๐ฎ๐น Italian | it |
โ Basic tokenization | โ ๏ธ Generic | โ Full | +20-30% |
| ๐ต๐น Portuguese | pt |
โ Basic tokenization | โ ๏ธ Generic | โ Full | +20-30% |
| ๐ณ๐ฑ Dutch | nl |
โ Basic tokenization | โ ๏ธ Generic | โ Full | +20-30% |
| All Others | * |
โ Basic tokenization | โ ๏ธ Generic | โ Full | +20-25% |
๐ What Every Language Gets
โ All languages benefit from:
- TF-IDF Vectorization: Professional semantic similarity calculation
- Intelligent Language Detection: Hybrid TYPO3 + content analysis
- Advanced Text Cleaning: Unicode normalization, accent removal
- Performance Caching: TF-IDF vectors and processing results cached
// ALL languages get TF-IDF processing $tfidfResult = $textVectorizer->createTfIdfVectors([$text1, $text2], $language); $similarity = $textVectorizer->cosineSimilarity($vector1, $vector2); // Only de/fr/en/es get advanced stemming if (in_array($language, ['de', 'fr', 'en', 'es'])) { $stemmedWords = $textAnalyzer->stem($text, $language); } else { $stemmedWords = $textAnalyzer->tokenize($text); // Basic tokenization }
๐ฏ Language-Specific Configuration Examples
๐ฉ๐ช German Sites (Maximum Performance)
# Scheduler Configuration: minimumSimilarity = 0.15
plugin.tx_semanticsuggestion_suggestions.settings {
enableStemming = 1 # CRITICAL for compound words
proximityThreshold = 0.25 # Lower threshold (compound matching)
minTextLength = 50 # German compound words in short text
analyzedFields {
title = 2.0 # German titles contain key compounds
keywords = 2.5 # German keywords very specific
content = 1.0 # Standard weight
description = 1.2 # Meta descriptions useful
abstract = 1.3 # German abstracts well-structured
}
}
๐ซ๐ท๐ฌ๐ง๐ช๐ธ French/English/Spanish Sites
# Scheduler Configuration: minimumSimilarity = 0.2
plugin.tx_semanticsuggestion_suggestions.settings {
enableStemming = 1 # Advanced stemming available
proximityThreshold = 0.3 # Standard TF-IDF threshold
minTextLength = 50 # Standard minimum length
analyzedFields {
title = 1.5 # Titles important but less compound
keywords = 2.0 # Keywords still highly relevant
content = 1.0 # Main content
description = 1.0 # Meta descriptions
abstract = 1.1 # Abstracts helpful
}
}
๐ฎ๐น๐ต๐น๐ณ๐ฑ Other Languages (Italian, Portuguese, Dutch, etc.)
# Scheduler Configuration: minimumSimilarity = 0.25
plugin.tx_semanticsuggestion_suggestions.settings {
enableStemming = 0 # No advanced stemming, but TF-IDF still helps
proximityThreshold = 0.35 # Higher threshold (less precise without stemming)
minTextLength = 100 # Longer text needed for TF-IDF accuracy
analyzedFields {
title = 1.8 # Rely more on titles without stemming
keywords = 2.2 # Keywords become more important
content = 0.8 # Content less reliable without stemming
description = 1.0 # Standard weight
abstract = 1.0 # Standard weight
}
}
Note: Even without stemming, languages like Italian and Portuguese still get significant improvements from TF-IDF vectorization compared to basic word counting.
Configuration
๐ฏ NEW in v3.1: Storage vs Display Quality Configuration for clarity! Separate qualityLevel parameters for storage (Scheduler) and display (TypoScript) with clear explanations.
๐ Storage vs Display Configuration (v3.1+)
The configuration separates what gets stored (Scheduler) from what gets displayed (TypoScript) for maximum flexibility:
๐ฏ CONFIGURATION FLOW:
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Storage QualityLevelโโโโโถโ Storage: direct โโโโโถโ Display QualityLevelโโโโโถโ Display Filter โ
โ (Scheduler Task) โ โ (exact match) โ โ (TypoScript) โ โ (quality) โ
โ 0.3 โ stores โฅ0.3 โ โ โ โ 0.4 โ shows โฅ0.4 โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
CONTROLS DATABASE SAVES SIMILARITIES CONTROLS FRONTEND USER SEES RESULTS
STORAGE EFFICIENCY FOR PRECISION DISPLAY QUALITY FILTERED SUGGESTIONS
Benefits:
- โ Clear separation of storage vs display logic
- โ Flexible filtering (display can be stricter than storage)
- โ Performance optimization (store broad, display selective)
- โ Backward compatibility with legacy configurations
- โ Self-explanatory values (higher = more selective)
Legacy Configuration Hierarchy (v3.0 and earlier)
โ ๏ธ DEPRECATED: This section documents the old system for reference. Use unified
qualityLevelinstead!
Click to view legacy configuration details
Understanding the configuration hierarchy is critical for proper setup. The extension uses a two-tier system where Scheduler settings always take precedence over TypoScript settings:
๐ CONFIGURATION FLOW:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Scheduler โโโโโถโ Database โโโโโถโ TypoScript โโโโโถโ Frontend โ
โ Settings โ โ Storage โ โ Settings โ โ Display โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4
(Analysis & (Persistent (Display & (User Sees
Storage) Data) Filtering) Results)
๐ฏ Level 1: Scheduler Configuration (Master Control)
- Controls: What gets analyzed and stored in database
- Authority: Absolute - cannot be overridden by TypoScript
- Key Settings:
startPageId: Defines analysis scopeexcludePages: Pages never analyzed (permanent exclusion)minimumSimilarity: Minimum score to store in databaserecursiveExclusion: How exclusions are applied
๐ Level 2: Database Storage (Persistent Data)
- Contains: Only similarities โฅ
minimumSimilarityfrom Scheduler - Source: Generated by Scheduler task execution
- Limitation: TypoScript cannot access data that was never stored
๐จ Level 3: TypoScript Configuration (Display Filter)
- Controls: What gets displayed from existing database data
- Limitation: Can only filter/limit existing data, cannot create new data
- Key Settings:
proximityThreshold: Minimum score to display (must be โฅ SchedulerminimumSimilarity)maxSuggestions: Limit displayed resultsexcludePages: Pages excluded from display only (still analyzed/stored)
๐๏ธ Level 4: Frontend Display (Final Output)
- Shows: Final filtered and limited results
- Source: Database data filtered by TypoScript settings
โ๏ธ Priority Rules
-
Scheduler ALWAYS wins:
Scheduler excludePages = "42,56" TypoScript excludePages = "" (empty) โ Pages 42,56 will NEVER appear (not analyzed at all) -
TypoScript cannot override Scheduler thresholds:
Scheduler minimumSimilarity = 0.5 TypoScript proximityThreshold = 0.3 โ Impossible! No similarity < 0.5 exists in database -
TypoScript can only be MORE restrictive:
Scheduler minimumSimilarity = 0.3 โ TypoScript proximityThreshold = 0.5 โ โ Works: Shows only similarities โฅ 0.5 from stored data โฅ 0.3
Scheduler Task Configuration
๐ง CRITICAL: This is the primary configuration that controls what gets analyzed and stored in the database. TypoScript cannot override these settings!
Create a "Semantic Suggestion: Generate Similarities" task in the TYPO3 Scheduler module with these settings:
โ ๏ธ WARNING: Choose your
minimumSimilaritycarefully - it cannot be lowered retroactively without re-running the entire analysis!
-
startPageId(required): The UID of the root page from which the analysis will begin. This defines the scope of the analysis for this task run. Each task execution is linked to a Start Page ID (stored asroot_page_idin the DB).- Example:
1(for site root page)
- Example:
-
excludePages(optional): Comma-separated list of page UIDs that will not be analyzed, and their similarities will not be stored.- Example:
42,56,78
- Example:
-
minimumSimilarity(required): Threshold (0.0 to 1.0) below which a pair of similar pages will not be saved to the database. This controls storage efficiency.- Example:
0.3(saves only pairs with similarity โฅ 30%)
- Example:
Recommended scheduling:
- Frequency: Daily or weekly
- Execution time: During off-peak hours (e.g., 2:00 AM)
TypoScript Settings
๐จ DISPLAY FILTER: These settings control the frontend display and the analysis algorithm details. They can only filter/limit what was already stored by the Scheduler.
โ CONSTRAINT:
proximityThresholdMUST be โฅ SchedulerminimumSimilarity(otherwise no suggestions will display)
Define them in your TypoScript Setup file under plugin.tx_semanticsuggestion_suggestions.settings.
Constants (constants.typoscript)
plugin.tx_semanticsuggestion_suggestions.settings {
# --- Frontend Display Settings ---
# โ ๏ธ IMPORTANT: Must be โฅ Scheduler minimumSimilarity
proximityThreshold = 0.25 # TF-IDF optimized threshold (was 0.5 in v2.x)
maxSuggestions = 3 # Maximum number of suggestions to display
excerptLength = 100 # Max length of the text excerpt
excludePages = # Pages to exclude from DISPLAY only (comma-separated UIDs)
# --- Analysis Algorithm Settings (Used by Scheduler task) ---
recencyWeight = 0.2 # Weight of recency in the final score (0.0 to 1.0)
# Fields analyzed and their weights (TF-IDF enhanced)
analyzedFields {
title = 1.5 # Page titles are highly relevant
description = 1.0 # Meta descriptions
keywords = 2.0 # Explicit keywords have high weight
abstract = 1.2 # Page abstracts/summaries
content = 1.0 # Main page content (can be noisy)
}
# --- NLP & Language Settings (v3.0+) ---
enableStemming = 1 # Enable advanced stemming (especially for German)
defaultLanguage = en # Fallback language
minTextLength = 50 # Minimum text length for analysis
confidenceThreshold = 0.3 # Language detection confidence threshold
# --- Debugging ---
debugMode = 0 # Enable debug logs (0 or 1)
}
Setup (setup.typoscript)
# Main plugin configuration
plugin.tx_semanticsuggestion_suggestions {
settings {
proximityThreshold = {$plugin.tx_semanticsuggestion_suggestions.settings.proximityThreshold}
maxSuggestions = {$plugin.tx_semanticsuggestion_suggestions.settings.maxSuggestions}
excludePages = {$plugin.tx_semanticsuggestion_suggestions.settings.excludePages}
excerptLength = {$plugin.tx_semanticsuggestion_suggestions.settings.excerptLength}
recencyWeight = {$plugin.tx_semanticsuggestion_suggestions.settings.recencyWeight}
debugMode = {$plugin.tx_semanticsuggestion_suggestions.settings.debugMode}
analyzedFields {
title = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.title}
description = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.description}
keywords = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.keywords}
abstract = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.abstract}
content = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.content}
}
}
view {
# Paths to your Fluid templates if you wish to customize them
templateRootPaths.10 = EXT:your_extension/Resources/Private/Templates/
partialRootPaths.10 = EXT:your_extension/Resources/Private/Partials/
layoutRootPaths.10 = EXT:your_extension/Resources/Private/Layouts/
}
}
# Reusable TypoScript object
lib.semantic_suggestion = USER
lib.semantic_suggestion {
userFunc = TYPO3\CMS\Extbase\Core\Bootstrap->run
extensionName = SemanticSuggestion
pluginName = Suggestions
vendorName = TalanHdf
view =< plugin.tx_semanticsuggestion_suggestions.view
persistence =< plugin.tx_semanticsuggestion_suggestions.persistence
settings =< plugin.tx_semanticsuggestion_suggestions.settings
}
# Content element integration
tt_content.list.20.semanticsuggestion_suggestions =< plugin.tx_semanticsuggestion_suggestions
NLP Configuration
Advanced Language & Text Processing Settings
plugin.tx_semanticsuggestion_suggestions {
settings {
# ๐ง NLP Features (NEW in v3.0)
enableStemming = 1 # Enable advanced stemming (German compound words)
defaultLanguage = en # Fallback language
minTextLength = 50 # Minimum text length for analysis
confidenceThreshold = 0.3 # Language detection confidence threshold
# ๐ Language Mapping (TYPO3 12/13)
languageMapping {
0 = en # Language UID 0 โ English
1 = fr # Language UID 1 โ French
2 = de # Language UID 2 โ German
3 = es # Language UID 3 โ Spanish
4 = it # Language UID 4 โ Italian
5 = pt # Language UID 5 โ Portuguese
}
# ๐ฉ๐ช German Language Optimization
proximityThreshold = 0.25 # Lower threshold for TF-IDF (more sensitive)
# ๐ TF-IDF Algorithm Settings
analyzedFields {
title = 1.5 # German titles often contain key compound words
keywords = 2.0 # German keywords are highly indicative
content = 1.0 # Base content weight
description = 1.0
abstract = 1.2
}
}
}
Multilingual Site Example Configuration
For a German/English bilingual site:
# constants.typoscript
plugin.tx_semanticsuggestion_suggestions.settings {
# ๐ฉ๐ช German language optimization
enableStemming = 1 # Critical for compound words
proximityThreshold = 0.25 # Lower threshold for German compound matching
confidenceThreshold = 0.3
# Language detection mapping
languageMapping {
0 = en # TYPO3 language UID 0 = English
1 = de # TYPO3 language UID 1 = German
}
# German-optimized field weights
analyzedFields {
title = 2.0 # German titles contain key compounds
keywords = 2.5 # German keywords are very specific
content = 1.0 # Standard content weight
}
}
# IMPORTANT: Create separate Scheduler tasks:
# Task 1: English content (startPageId = 1, minimumSimilarity = 0.3)
# Task 2: German content (startPageId = 10, minimumSimilarity = 0.25)
Configuration Interaction
- Analysis Scope: Defined by the Scheduler task's
startPageId. - DB Storage: Controlled by the Scheduler task's
minimumSimilarityandexcludePages. - Similarity Calculation: Performed by the
PageAnalysisService(called by the Scheduler task), which uses the TypoScript settingsanalyzedFieldsandrecencyWeight. - Frontend Display: Reads from the DB and filters/limits based on the TypoScript settings
proximityThreshold,maxSuggestions,excludePages. - Backend Display: Reads from the DB (based on the selected
root_page_id) and filters based on the TypoScriptproximityThreshold.
Key Points:
- The
proximityThreshold(TypoScript) cannot display suggestions with a score lower than theminimumSimilarity(Scheduler) because they were not saved. For the TypoScript setting to be effective, it must be โฅ the Scheduler threshold. - A page excluded in the Scheduler will never be analyzed/stored. A page excluded only in TypoScript will be analyzed/stored (if not excluded in Scheduler) but not displayed. It's often simpler to keep the
excludePageslists synchronized. - You can create multiple Scheduler tasks with different
startPageIdvalues to analyze different sections of the site.
Unified Configuration Setup
1. Scheduler Task Configuration (Simplified)
Create a "Semantic Suggestion: Generate Similarities" task with these settings:
startPageId(required): Root page UID for analysis scopequalityLevel(required): Quality threshold (0.1-1.0) for suggestions- Storage: Automatically set to
qualityLevel - 0.1(broad data collection) - Display: Uses
qualityLeveldirectly (quality suggestions to users)
- Storage: Automatically set to
excludePages(optional): Pages to exclude from both analysis and displayrecursiveExclusion(optional): Apply exclusions recursively
Recommended Quality Levels:
๐ฉ๐ช German sites: 0.25 (compound word optimization)
๐ Standard sites: 0.30 (balanced quality/quantity)
๐ Content sites: 0.35 (higher quality threshold)
๐ Premium sites: 0.40 (very selective suggestions)
2. TypoScript Configuration (Simplified)
plugin.tx_semanticsuggestion_suggestions {
settings {
# ๐ฏ UNIFIED QUALITY CONTROL
qualityLevel = 0.3 # Single parameter for all quality control
# Display Settings
maxSuggestions = 3 # Number of suggestions to show
excludePages = # Additional pages to exclude from display
excerptLength = 100 # Text excerpt length
# NLP Settings (v3.0+)
enableStemming = 1 # Enable advanced text processing
defaultLanguage = en # Fallback language
debugMode = 0 # Debug logging
}
}
Configuration Examples
Basic Setup (Most Common)
# Single quality level controls everything
qualityLevel = 0.3
# Result:
# - Storage threshold: 0.2 (collects broad range)
# - Display threshold: 0.3 (shows quality suggestions)
# - No configuration conflicts possible
German Site Optimization
# Lower threshold for German compound words
qualityLevel = 0.25
enableStemming = 1
# Result optimized for compound words like:
# "Automobil" โ "Automobilindustrie"
High-Quality Site
# Higher threshold for premium content
qualityLevel = 0.4
# Result:
# - Storage threshold: 0.3 (good range)
# - Display threshold: 0.4 (only excellent suggestions)
Migration from Legacy Configuration
Automatic Migration
The extension automatically migrates old configurations:
Legacy v3.0:
Scheduler: minimumSimilarity = 0.4
TypoScript: proximityThreshold = 0.5
Auto-migrated to v3.1:
qualityLevel = 0.5 (migrated from proximityThreshold)
Internal storage = 0.4 (preserved)
Manual Migration Guide
- Identify your old
proximityThresholdfrom TypoScript - Set
qualityLevelto that value - Remove old parameters:
- Delete
proximityThresholdfrom TypoScript - Delete
minimumSimilarityfrom Scheduler (auto-computed) - Merge duplicate
excludePageslists
- Delete
Example Migration:
# OLD v3.0 configuration
plugin.tx_semanticsuggestion_suggestions.settings {
proximityThreshold = 0.35 # OLD: Display threshold
excludePages = 42,56,78 # OLD: Display exclusions only
}
# NEW v3.1 configuration
plugin.tx_semanticsuggestion_suggestions.settings {
qualityLevel = 0.35 # NEW: Unified quality control
excludePages = 42,56,78 # NEW: Unified exclusions (analysis + display)
}
# Scheduler task OLD: minimumSimilarity = 0.25, excludePages = ""
# Scheduler task NEW: qualityLevel = 0.35 (automatically computes storage = 0.25)
Legacy Configuration Constraints & Common Pitfalls
Understanding these technical constraints will save you hours of debugging:
๐ซ Critical Constraints
1. Threshold Hierarchy Rule
โ WRONG Configuration:
Scheduler: minimumSimilarity = 0.7
TypoScript: proximityThreshold = 0.3
โ Result: NO suggestions displayed (none stored below 0.7)
โ
CORRECT Configuration:
Scheduler: minimumSimilarity = 0.3
TypoScript: proximityThreshold = 0.7
โ Result: Shows high-quality suggestions from broader stored data
Rule: proximityThreshold โฅ minimumSimilarity (or equal)
2. Exclusion Scope Difference
โ PROBLEMATIC:
Scheduler: excludePages = "" (empty)
TypoScript: excludePages = "42,56,78"
โ Result: Pages 42,56,78 are analyzed/stored but never displayed (wasted processing)
โ
EFFICIENT:
Scheduler: excludePages = "42,56,78"
TypoScript: excludePages = "" (empty or same list)
โ Result: Pages 42,56,78 are never processed (faster, cleaner)
3. TF-IDF Score Ranges (NEW in v3.0)
โ ๏ธ TF-IDF produces LOWER scores than v2.x cosine similarity:
v2.x typical range: 0.3-0.9
v3.0 TF-IDF range: 0.05-0.4
โ Legacy Configuration:
minimumSimilarity = 0.8 โ NO results with TF-IDF
โ
TF-IDF Optimized:
minimumSimilarity = 0.15 โ Good range for TF-IDF
proximityThreshold = 0.25 โ Quality suggestions
๐ Language-Specific Constraints
German Language (Compound Words)
๐ฉ๐ช German sites need LOWER thresholds due to compound word stemming:
โ Standard Configuration:
proximityThreshold = 0.5 โ Misses compound relationships
โ
German Optimized:
minimumSimilarity = 0.15
proximityThreshold = 0.25
enableStemming = 1
โ Captures "Automobil" โ "Automobilindustrie" relationships
Multi-language Sites
โ ๏ธ CONSTRAINT: Each language needs separate analysis
โ Single Task Configuration:
Task 1: startPageId = 1 (analyzing both English + German pages)
โ Result: Language mixing, poor similarity quality
โ
Multi-Task Configuration:
Task 1: startPageId = 1 (English root) - minimumSimilarity = 0.3
Task 2: startPageId = 10 (German root) - minimumSimilarity = 0.25
โ Result: Optimized per-language analysis
โก Performance Constraints
Large Site Thresholds
Sites with >500 pages:
โ Permissive Configuration:
minimumSimilarity = 0.05 โ Database bloat (millions of records)
maxSuggestions = 10 โ Slow frontend queries
โ
Performance Optimized:
minimumSimilarity = 0.25 โ Quality storage
proximityThreshold = 0.4 โ Fast display
maxSuggestions = 3 โ Quick queries
๐ง Validation Rules Checklist
Before deploying, verify these constraints:
- Threshold Check:
proximityThresholdโฅminimumSimilarity - Exclusion Sync: Scheduler
excludePagesincludes all TypoScript exclusions - TF-IDF Adjustment: Thresholds lowered from v2.x values (โค 0.4 typically)
- Language Separation: Each language has its own Scheduler task
- Performance Test: Task execution time acceptable for your server
- Storage Monitoring: Database table
tx_semanticsuggestion_similaritiessize reasonable
๐จ Warning Signs of Misconfiguration
Watch for these symptoms:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Zero suggestions displayed | proximityThreshold > all stored similarities |
Lower proximityThreshold or minimumSimilarity |
| Poor suggestion quality | Threshold too low | Raise proximityThreshold |
| Missing expected pages | Pages excluded in Scheduler | Check excludePages settings |
| Scheduler timeouts | Large site with low threshold | Raise minimumSimilarity |
| Mixed language results | Single task for multilingual site | Create per-language tasks |
Usage (Frontend)
Integrate the plugin into your Fluid templates to display suggestions:
<f:cObject typoscriptObjectPath='lib.semantic_suggestion' />
Or include it directly in TypoScript:
# Include on a page
page.10 =< lib.semantic_suggestion
# Or in a content element
lib.myContent = COA
lib.myContent {
10 =< lib.semantic_suggestion
}
The plugin will read relevant suggestions for the current page from the database, applying filters defined in the TypoScript settings (proximityThreshold, maxSuggestions, excludePages).
Backend Module
A backend module ("Semantic Suggestion" under "Web") allows visualizing the results of the analyses stored in the database.
Features
- Analysis Selection: Choose which analysis to view (based on the
startPageId/root_page_idof executed Scheduler tasks). - Detailed Statistics: Most similar pairs, score distribution, pages with the most links, language statistics.
- Configuration Overview: Reminder of the main parameters used (display threshold, etc.).
- Performance Metrics: Module load time, number of stored pairs for the selected analysis.
Scheduler Task
The "Semantic Suggestion: Generate Similarities" task is essential for the extension's operation.
- Role: Calculates similarities between pages (using
PageAnalysisService) and saves relevant results (above theminimumSimilaritythreshold) to thetx_semanticsuggestion_similaritiestable. - Configuration: Set the
startPageId,excludePages, andminimumSimilarityvia the Scheduler interface. - Frequency: Schedule its execution regularly (e.g., daily, weekly) during off-peak hours to keep suggestions up-to-date without impacting site performance.
Similarity Logic (TF-IDF Enhanced)
๐ Processing Pipeline
-
๐ฏ Language Detection (NEW in v3.0)
Text Input โ TYPO3 Site Context โ Content Analysis โ Language: "de" -
๐ Advanced Text Processing (via nlp_tools)
Raw Text โ Stop Words Removal โ Stemming (German) โ Clean Tokens Example: "Die deutschen Automobilhersteller" โ "deutsch automobilherstell" (stemmed) -
๐ข TF-IDF Vectorization (NEW in v3.0)
[Text1, Text2] โ TF-IDF Vectors โ Cosine Similarity Score Traditional: Simple word counting TF-IDF: Professional semantic analysis -
๐พ Smart Storage & Display
Similarity Score + Recency Boost โ Database โ Frontend Filter Scheduler threshold: 0.3 โ Display threshold: 0.5
๐ Algorithm Improvements
Before (v2.x): Basic Cosine Similarity
// Simple word frequency comparison $similarity = $dotProduct / ($magnitude1 * $magnitude2);
After (v3.0): TF-IDF + NLP
// Professional semantic analysis $tfidfResult = $this->textVectorizer->createTfIdfVectors([$text1, $text2], $language); $similarity = $this->textVectorizer->cosineSimilarity($vector1, $vector2); // + German stemming, stop word removal, confidence scoring
Performance Impact: 30-50% better accuracy, especially for German compound words.
Multilingual Sites Setup
๐ TYPO3 12/13 Site Configuration
The extension automatically detects languages from your TYPO3 site configuration:
# config/sites/main/config.yaml languages: - languageId: 0 title: 'English' hreflang: 'en-US' locale: 'en_US.UTF-8' # โ Auto-detected as "en" iso-639-1: 'en' - languageId: 1 title: 'Deutsch' hreflang: 'de-DE' locale: 'de_DE.UTF-8' # โ Auto-detected as "de" iso-639-1: 'de' - languageId: 2 title: 'Franรงais' hreflang: 'fr-FR' locale: 'fr_FR.UTF-8' # โ Auto-detected as "fr" iso-639-1: 'fr'
๐ฏ Per-Language Scheduler Tasks
For optimal performance, create separate Scheduler tasks for each language:
Task 1: "Generate Similarities - English"
- startPageId: 1 (English root)
- minimumSimilarity: 0.3
Task 2: "Generate Similarities - German"
- startPageId: 2 (German root)
- minimumSimilarity: 0.25 # Lower for German (compound words)
Task 3: "Generate Similarities - French"
- startPageId: 3 (French root)
- minimumSimilarity: 0.3
๐ง Language-Specific Tuning
German Language Optimization
plugin.tx_semanticsuggestion_suggestions.settings {
# German compound words need lower thresholds
proximityThreshold = 0.25
# Enable stemming for better compound word detection
enableStemming = 1
# German titles are often highly descriptive
analyzedFields {
title = 2.0 # Higher weight for German titles
keywords = 2.5 # German keywords are very specific
}
}
๐จ Troubleshooting Multilingual Sites
Problem: Language always detected as "en"
Solution: Check your TypoScript language mapping:
plugin.tx_semanticsuggestion_suggestions.settings {
languageMapping {
0 = en
1 = de # Make sure this matches your site language UID
2 = fr
}
}
Problem: Mixed language content not detected properly
Enable debug mode to see detection process:
plugin.tx_semanticsuggestion_suggestions.settings {
debugMode = 1
}
Check logs for:
Language detected via nlp_tools: de
TF-IDF similarity calculated: language=de, vocabularySize=156
Confidence verification: firstScore=0.85, secondScore=0.45, using content analysis
Display Customization
Modify the appearance of suggestions by overriding the plugin's Fluid template (List.html). Configure the paths to your custom templates in TypoScript (see Configuration section).
Debugging & Performance
๐ Debug Mode
Enable comprehensive logging to troubleshoot language detection and similarity calculation:
plugin.tx_semanticsuggestion_suggestions.settings {
debugMode = 1
}
Debug logs show:
[INFO] Language detected via nlp_tools: de
[DEBUG] TF-IDF similarity calculated: page1=123, page2=456, similarity=0.75, vocabularySize=245
[DEBUG] German stemming applied: "Automobilindustrie" โ "automobilindustr"
[DEBUG] Confidence verification: using TYPO3 context due to low confidence difference
[ERROR] Failed to create TF-IDF vectors: text too short, using fallback calculation
โก Performance Monitoring
Backend Module shows performance metrics:
- TF-IDF processing time per page pair
- Language detection accuracy
- Vocabulary size per language
- Cache hit rates for stemming/vectorization
Scheduler Task execution time:
- Small sites (<100 pages): ~10-30 seconds
- Medium sites (100-500 pages): ~1-5 minutes
- Large sites (>500 pages): ~5-30 minutes
๐ Performance Optimization
Recommended Settings for Large Sites (>500 pages)
# Scheduler Configuration: minimumSimilarity = 0.3 (storage optimization)
plugin.tx_semanticsuggestion_suggestions.settings {
# Performance optimizations
minTextLength = 100 # Skip short content (faster processing)
proximityThreshold = 0.4 # Higher display threshold (faster queries)
maxSuggestions = 3 # Limit results (faster rendering)
# Selective field analysis (reduce processing time)
analyzedFields {
title = 2.0 # Titles are fast to process
keywords = 2.0 # Keywords are lightweight
content = 0.5 # Reduce content weight (can be slow)
description = 1.0 # Meta descriptions are fast
abstract = 0 # Disable if not commonly used
}
# Conservative language settings
confidenceThreshold = 0.4 # Higher confidence reduces processing
enableStemming = 1 # Keep enabled (cached results)
}
# SCHEDULER OPTIMIZATION for large sites:
# Split into multiple tasks:
# Task 1: Pages 1-100 (daily)
# Task 2: Pages 101-200 (every 2 days)
# Task 3: Pages 201+ (weekly)
Cache Configuration
Ensure TYPO3 cache is properly configured for optimal performance:
# Clear cache after configuration changes ./vendor/bin/typo3 cache:flush # Monitor cache effectiveness ./vendor/bin/typo3 cache:listGroups
Configuration Troubleshooting
๐ Step-by-Step Diagnosis
When suggestions aren't working as expected, follow this systematic approach:
Step 1: Verify Scheduler Task Execution
# Check if task has run successfully ./vendor/bin/typo3 scheduler:run # Check TYPO3 logs for task errors tail -f var/log/typo3_*.log | grep -i semantic
Expected Output:
[INFO] Starting similarity generation task, startPageId: 1, minimumSimilarity: 0.3
[INFO] Similarity generation task completed successfully
Step 2: Verify Database Content
-- Check if similarities are stored SELECT COUNT(*) as total_similarities FROM tx_semanticsuggestion_similarities; -- Check score distribution SELECT ROUND(similarity_score, 1) as score_range, COUNT(*) as count, sys_language_uid FROM tx_semanticsuggestion_similarities GROUP BY ROUND(similarity_score, 1), sys_language_uid ORDER BY score_range DESC;
Healthy Output Example:
score_range | count | sys_language_uid
0.4 | 15 | 0
0.3 | 42 | 0
0.2 | 128 | 0
0.1 | 203 | 0
Step 3: Test Configuration Hierarchy
# Enable debug mode temporarily
plugin.tx_semanticsuggestion_suggestions.settings {
debugMode = 1
}
Check Debug Logs:
tail -f typo3temp/logs/semantic_suggestion.log
๐จ Common Problems & Solutions
Problem 1: "No suggestions displayed anywhere"
Symptoms:
- Frontend shows empty suggestions list
- Backend module shows 0 similar pairs
Diagnosis Checklist:
โ Scheduler task executed successfully?
โ Database contains similarities?
โ proximityThreshold โค stored similarities?
โ Current page has stored similarities?
Solutions:
-
Threshold Too High
โ Current: proximityThreshold = 0.8 โ Fix: proximityThreshold = 0.3 (or lower) # Or check what's actually in database: SELECT MAX(similarity_score) FROM tx_semanticsuggestion_similarities; -
Scheduler Never Ran
# Manual execution ./vendor/bin/typo3 scheduler:run <task_id> # Check task configuration SELECT * FROM tx_scheduler_task WHERE classname LIKE '%Similarities%';
-
Wrong Root Page
โ Current: startPageId = 1, viewing page = 42 โ Fix: startPageId = 1, ensure page 42 is child of page 1 # Verify page tree relationship SELECT pid, title FROM pages WHERE uid = 42;
Problem 2: "Suggestions displayed but poor quality"
Symptoms:
- Suggestions shown but irrelevant
- Mixed languages in suggestions
- Very low similarity scores
Solutions:
-
TF-IDF Score Adjustment (v3.0+)
โ Legacy: proximityThreshold = 0.7 โ TF-IDF: proximityThreshold = 0.25 # TF-IDF scores are naturally lower -
Language Separation Required
โ Single task: Pages 1-100 (mixed EN/DE content) โ Multi-task: - Task 1: English pages (1-50) - Task 2: German pages (51-100) -
Field Weight Optimization
# Increase weight of reliable fields analyzedFields { title = 2.0 # Titles are usually accurate keywords = 2.5 # Keywords are intentional content = 0.5 # Content can be noisy }
Problem 3: "Missing expected pages in suggestions"
Symptoms:
- Page A should suggest Page B (they're clearly related)
- Page B exists in database but not suggested to Page A
Diagnosis:
-- Check if relationship exists in database SELECT similarity_score FROM tx_semanticsuggestion_similarities WHERE page_id = A AND similar_page_id = B; -- Check if excluded somewhere SELECT exclude_pages FROM tx_scheduler_task WHERE classname LIKE '%Similarities%';
Solutions:
-
Page Excluded in Scheduler
โ Scheduler excludePages = "42,56,78" (contains Page B) โ Remove Page B from exclusions, re-run Scheduler -
Similarity Below Threshold
-- Find actual similarity score SELECT similarity_score FROM tx_semanticsuggestion_similarities WHERE page_id = A AND similar_page_id = B; -- If score = 0.22 but proximityThreshold = 0.3 -- Lower the threshold or improve content similarity
-
Text Content Insufficient
Check if Page A or B has minimal text content: - Minimum 50 characters required - Pure image pages won't generate similarities - Check 'minTextLength' setting
Problem 4: "Scheduler task timeouts or fails"
Symptoms:
- Task shows "Failed" status
- PHP timeout errors in logs
- Task takes >5 minutes
Solutions:
-
Increase Processing Limits
# In Scheduler task or php.ini ini_set('max_execution_time', 300); // 5 minutes ini_set('memory_limit', '512M');
-
Reduce Analysis Scope
โ Current: startPageId = 1 (1000+ pages) โ Split: - Task 1: startPageId = 1 (pages 1-100) - Task 2: startPageId = 101 (pages 101-200) -
Increase Threshold
โ Current: minimumSimilarity = 0.05 (stores everything) โ Optimized: minimumSimilarity = 0.25 (quality only)
Problem 5: "Mixed language suggestions"
Symptoms:
- English page suggests German pages
- Suggestions ignore language boundaries
Solutions:
-
Enable Language Mapping
plugin.tx_semanticsuggestion_suggestions.settings { languageMapping { 0 = en 1 = de 2 = fr } } -
Check Site Configuration
# site/config.yaml should have proper locales languages: - languageId: 0 locale: 'en_US.UTF-8' # โ Must be properly formatted - languageId: 1 locale: 'de_DE.UTF-8' # โ Must be properly formatted
-
Separate Scheduler Tasks
Instead of: One task analyzing mixed language tree Use: One task per language branch
๐ง Quick Fix Commands
# Clear all caches ./vendor/bin/typo3 cache:flush # Re-run all scheduler tasks ./vendor/bin/typo3 scheduler:run # Check database table size echo "SELECT COUNT(*) FROM tx_semanticsuggestion_similarities;" | mysql your_db # Reset extension configuration (emergency) ./vendor/bin/typo3 configuration:remove --path="EXTENSIONS/semantic_suggestion" ./vendor/bin/typo3 extension:deactivate semantic_suggestion ./vendor/bin/typo3 extension:activate semantic_suggestion
๐ Pre-Deployment Checklist
Before going live, verify:
- Scheduler Tasks: All tasks run successfully without timeouts
- Database Check:
tx_semanticsuggestion_similaritiescontains expected data - Threshold Validation:
proximityThresholdโฅminimumSimilarity - Language Testing: Each language shows appropriate suggestions
- Performance Test: Frontend loads suggestions in <200ms
- Content Quality: Manual review of suggestion relevance
- Exclusion Review: All intentionally excluded pages work correctly
๐ก๏ธ Final Configuration Validation Checklist
Use this comprehensive checklist to ensure your configuration is optimal:
โ Scheduler Configuration Validation
- startPageId exists and is accessible:
SELECT title FROM pages WHERE uid = [startPageId]; - minimumSimilarity appropriate for TF-IDF: Between 0.1 and 0.4 (not v2.x values like 0.8)
- excludePages list verified: All UIDs exist and are intentionally excluded
- Task execution successful: Check task history and logs for errors
- Multilingual separation: Each language has its own task (recommended)
โ TypoScript Configuration Validation
- Threshold hierarchy respected:
proximityThreshold โฅ minimumSimilarity - TF-IDF thresholds updated: Not using v2.x legacy values (>0.5)
- Language settings match site:
languageMappingcorresponds to TYPO3 language UIDs - Field weights optimized: Higher weights for title/keywords, lower for content
- Performance settings:
maxSuggestionsandminTextLengthappropriate for site size
โ Database & Storage Validation
-- Verify data exists and has reasonable distribution SELECT sys_language_uid, MIN(similarity_score) as min_score, MAX(similarity_score) as max_score, AVG(similarity_score) as avg_score, COUNT(*) as total_pairs FROM tx_semanticsuggestion_similarities GROUP BY sys_language_uid; -- Check for unexpected language mixing SELECT DISTINCT root_page_id, sys_language_uid, COUNT(*) as pairs FROM tx_semanticsuggestion_similarities GROUP BY root_page_id, sys_language_uid;
โ Frontend Integration Validation
- Plugin displays correctly:
<f:cObject typoscriptObjectPath='lib.semantic_suggestion' />works - Suggestions appear on pages: Test on multiple pages with different content
- Language separation working: German pages don't show English suggestions
- Exclusions effective: Excluded pages don't appear in suggestions
- Performance acceptable: Page load time impact <100ms
โ NLP & Language Detection Validation
- nlp_tools dependency installed:
composer show cywolf/nlp-tools - Stemming working (German sites): Debug logs show stemmed words
- Language detection accurate: Pages analyzed in correct language
- Site configuration proper: Locales formatted as
de_DE.UTF-8(not justde)
๐จ Red Flags to Watch For
| Warning Sign | Quick Test | Solution |
|---|---|---|
| Zero suggestions anywhere | SELECT COUNT(*) FROM tx_semanticsuggestion_similarities; |
Lower thresholds or check task |
| Very low similarity scores (all <0.1) | Check debug logs for TF-IDF failures | Verify text length and language detection |
| Mixed language results | Test DE page shows EN suggestions | Separate scheduler tasks |
| Poor suggestion quality | Manual review of top suggestions | Adjust field weights or thresholds |
| Slow performance | Frontend timing >500ms | Increase thresholds or reduce maxSuggestions |
๐ฏ Configuration Quality Score
Rate your setup (aim for 80%+ before going live):
Basic Setup (50 points)
- Scheduler task runs (20 pts)
- Database contains data (15 pts)
- Frontend shows suggestions (15 pts)
Optimization (30 points)
- TF-IDF thresholds optimized (10 pts)
- Language separation implemented (10 pts)
- Performance <200ms (10 pts)
Advanced Features (20 points)
- German stemming active (5 pts)
- Debug logging configured (5 pts)
- Exclusions properly managed (5 pts)
- Field weights customized (5 pts)
Score: ___ / 100
Target: 80+ for production deployment Minimum: 60+ for staging/testing
Migration from v2.x
๐ Automatic Migration
The extension includes automatic fallback mechanisms:
- TF-IDF Processing: Falls back to old cosine similarity if nlp_tools fails
- Language Detection: Falls back to TypoScript mapping if site detection fails
- Text Processing: Falls back to basic stop word removal if stemming fails
๐ Migration Checklist
-
Install nlp_tools dependency:
composer require cywolf/nlp-tools
-
Update TypoScript configuration (add new NLP settings):
plugin.tx_semanticsuggestion_suggestions.settings { enableStemming = 1 languageMapping { 0 = en 1 = de # ... your language mapping } } -
Clear TYPO3 cache:
./vendor/bin/typo3 cache:flush
-
Regenerate similarities (run Scheduler task):
- All existing similarities will be recalculated with TF-IDF
- German content should see immediate improvement
- Check debug logs to verify language detection
-
Monitor performance:
- Check backend module for accuracy improvements
- Compare similarity scores before/after migration
- Verify multilingual sites work correctly
๐ Rollback Plan
If you experience issues, you can disable advanced features:
plugin.tx_semanticsuggestion_suggestions.settings {
enableStemming = 0 # Disable stemming
debugMode = 1 # Enable debug logging
minTextLength = 200 # Increase minimum text length
}
The extension will automatically fall back to v2.x behavior while maintaining database compatibility.
Contributing
Contributions are welcome! Fork the repository, create a branch, make your changes, and submit a Pull Request.
License
This project is licensed under the GNU General Public License v2.0 or later. See the LICENSE file.
Support
Contact: Wolfangel Cyril (cyril.wolfangel@gmail.com) Bugs & Features: GitHub Issues Documentation & Updates: GitHub Repository
