loupe / matcher
Tokenize, decompose, highlight and crop around text and search terms
Fund package maintenance!
Requires
- php: ^8.1
- ext-intl: *
- ext-mbstring: *
- ext-zlib: *
- toflar/fast-set: ^1.0
Requires (Dev)
- phpbench/phpbench: ^1.2
- phpstan/phpstan: ^2.0
- phpunit/phpunit: ^10.5 || ^11.5.46
- symfony/console: ^6.4 || ^7.4 || ^8.0
- symfony/filesystem: ^6.4 || ^7.4 || ^8.0
- symfony/finder: ^6.4 || ^7.4 || ^8.0
- symfony/http-client: ^6.4 || ^7.4 || ^8.0
- symfony/var-dumper: ^6.4 || ^7.4 || ^8.0
- symplify/easy-coding-standard: ^12.5
- dev-main
- 0.3.0
- 0.2.4
- 0.2.3
- 0.2.2
- 0.2.1
- 0.2.0
- 0.1.1
- 0.1.0
- dev-perf/stopwords-lookup
- dev-perf/matcher-benchmark
- dev-perf/decrease-control-benchmark
- dev-feature/prioritize-matches
- dev-main-before-decomposition
- dev-perf/tokenizer-variants
- dev-fix/double-tokenization
- dev-feature/dictionary-decompound
- dev-fix/bench
- dev-fix/crop-length
- dev-fix/normalization-token-length
This package is auto-updated.
Last update: 2026-06-05 13:31:36 UTC
README
Loupe Matcher turns plain search queries and arbitrary text into precise, user-friendly matches: tokenize phrases and negations, normalize language-specific spelling variants, decompose compound words, calculate match spans, and format the result with highlighting, cropping, and truncation.
Installation
composer require loupe/matcher
Quick Start
Here's a simple example of how to use Loupe Matcher to highlight search terms in a text document and crop around the highlights:
use Loupe\Matcher\Tokenizer\LocaleConfiguration\English; use Loupe\Matcher\Tokenizer\Tokenizer; use Loupe\Matcher\Matcher; use Loupe\Matcher\Formatter; use Loupe\Matcher\FormatterOptions; $tokenizer = new Tokenizer(new English()); $matcher = new Matcher($tokenizer); $formatter = new Formatter($matcher); $options = (new FormatterOptions()) ->withEnableHighlight() ->withEnableCrop() ->withCropLength(20); $result = $formatter->format( 'I always take my toothbrush with me for holidays', 'brush', $options ); // "…take my <em>toothbrush</em> with…"
Core Components
Tokenizer
Purpose: Breaks text into searchable tokens (words, phrases, terms) for accurate matching.
The Tokenizer converts strings into TokenCollection objects, handling:
- Word boundaries using
ext-intlrules - Phrase groups (quoted terms like
"exact phrase") - Negated terms (prefixed with
-) - Locale-specific tokenization
- Locale-specific term decomposition
$localeConfiguration = null; // Must implement the `LocaleConfigurationInterface`. $tokenizer = new Tokenizer($localeConfiguration); // Optional locale configuration $tokens = $tokenizer->tokenize('search for "exact phrase" -exclude'); $tokens->all(); // All tokens $tokens->phraseGroups(); // Quoted phrases only $tokens->allNegated(); // Terms to exclude
If you want to configure the way the Tokenizer handles locale specifics (such as decomposition or normalization), you
can provide your own implementation of the LocaleConfigurationInterface or use any of the pre-built configurations shipped
with this library. There are currently the following:
- English: Handles decomposition (
toothbrush->tooth,brush) - German: Handles normalization of German umlauts as well as
ßand also decomposition (Zeitungspapier->zeitung,papier)
Checkout the separate docs on decomposition if you want to improve the existing locale configurations or add support for a new one!
Matcher
Purpose: Finds which tokens in your text match the search query.
The Matcher compares tokenized text against search terms, with support for:
- Stop word filtering (ignore common words like "the", "and")
- Match span calculation (start/end positions)
- Flexible matching between token collections
$matcher = new Matcher($tokenizer, ['the', 'and', 'or']); // Stop words $matches = $matcher->calculateMatches('Text to search', 'search query'); // Get position information for highlighting $spans = $matcher->calculateMatchSpans('Text to search', 'query', $matches); foreach ($spans as $span) { echo "Match at position {$span->getStartPosition()}-{$span->getEndPosition()}"; }
Formatter
Purpose: Combines matching and highlighting to create formatted output with context.
The Formatter orchestrates the entire process:
- Highlights matched terms with HTML tags
- Crops text to show relevant context around matches
- Truncates long text while preserving word boundaries and highlights
- Configurable through
FormatterOptions
$formatter = new Formatter($matcher); $options = (new FormatterOptions()) ->withEnableHighlight() ->withHighlightStartTag('<mark>') ->withHighlightEndTag('</mark>') ->withEnableCrop() ->withCropLength(150) ->withCropMarker(' ... ') ->withEnableTruncation() ->withTruncationLength(200) ->withTruncationMarker('...') ->withEnableMatchPrioritization(); $result = $formatter->format($text, $query, $options); echo $result->getFormattedText();
Match Prioritization
By default, cropping emits snippets around every match cluster and truncation cuts from the start. Enabling withEnableMatchPrioritization() will attempt to choose the most relevant window(s) for display. Windows are scored by distinct query terms hit, then total matches, then density.
- Cropping now finds the best windows around matches, limits each window to
crop_lengthand shows up tocrop_max_fragmentswindows in document order. - Truncation picks a single window centered on the best cluster of matches and falls back to truncating from the start if no matches are found in the attribute.
Advanced Usage
Custom Tokenizer
Implement TokenizerInterface for specialized tokenization:
class CustomTokenizer implements TokenizerInterface { public function tokenize(string $text): TokenCollection { // Your custom tokenization logic } public function matches(Token $token, TokenCollection $tokens): bool { // Your custom logic for checking if a token is a match } }
Pre-highlighted Text Cropping
When you already have highlighted text that needs cropping:
$cropper = new \Loupe\Matcher\Formatting\Cropper( cropLength: 50, cropMarker: '…', highlightStartTag: '<em>', highlightEndTag: '</em>' ); // "...text with <em>highlighted</em> terms." echo $cropper->cropHighlightedText('Long text with <em>highlighted</em> terms.');
Using Pre-calculated Matches
When you already have a TokenCollection of matches (e.g., from a previous search operation or external source), you can format text directly without re-calculating matches. This approach is useful when your search engine already provides match information or you want to cache match results for performance.
// Assume you already have matches from somewhere else $existingMatches = new TokenCollection(/* ... */); // Set up the tokenizer, matcher, and formatter as usual $tokenizer = new Tokenizer(); $matcher = new Matcher($tokenizer); $formatter = new Formatter($matcher); $options = (new FormatterOptions()) ->withEnableHighlight() ->withEnableCrop() ->withCropLength(100); // Format using the existing matches - no duplicate processing $result = $formatter->format($text, $query, $options, matches: $existingMatches); echo $result->getFormattedText();