onoi / tesa
A simple library to sanitize text elements
Installs: 252 752
Dependents: 1
Suggesters: 0
Security: 0
Stars: 3
Watchers: 2
Forks: 2
Open Issues: 1
pkg:composer/onoi/tesa
Requires
- php: >=5.3.2
- ext-mbstring: *
- wikimedia/cdb: ~1.0
- wikimedia/textcat: ~1.1
This package is auto-updated.
Last update: 2025-10-29 01:51:00 UTC
README
The library contains a small collection of helper classes to support sanitization of text or string elements of arbitrary length with the aim to improve search match confidence during a query execution that is required by Semantic MediaWiki project and is deployed independently.
Requirements
- PHP 5.3 / HHVM 3.5 or later
- Recommended to enable the ICU extension
Installation
The recommended installation method for this library is by adding the following dependency to your composer.json.
{
	"require": {
		"onoi/tesa": "~0.1"
	}
}
Usage
use Onoi\Tesa\SanitizerFactory; use Onoi\Tesa\Transliterator; use Onoi\Tesa\Sanitizer; $sanitizerFactory = new SanitizerFactory(); $sanitizer = $sanitizerFactory->newSanitizer( 'A string that contains ...' ); $sanitizer->reduceLengthTo( 200 ); $sanitizer->toLowercase(); $sanitizer->replace( array( "'", "http://", "https://", "mailto:", "tel:" ), array( '' ) ); $sanitizer->setOption( Sanitizer::MIN_LENGTH, 4 ); $sanitizer->setOption( Sanitizer::WHITELIST, array( 'that' ) ); $sanitizer->applyTransliteration( Transliterator::DIACRITICS | Transliterator::GREEK ); $text = $sanitizer->sanitizeWith( $sanitizerFactory->newGenericTokenizer(), $sanitizerFactory->newNullStopwordAnalyzer(), $sanitizerFactory->newNullSynonymizer() );
- SanitizerFactoryis expected to be the sole entry point for services and instances when used outside of this library
- IcuWordBoundaryTokenizeris a preferred tokenizer in case the ICU extension is available
- NGramTokenizeris provided to increase CJK match confidence in case the back-end does not provide an explicit ngram tokenizer
- StopwordAnalyzertogether with a- LanguageDetectoris provided as a means to reduce ambiguity of frequent "noise" words from a possible search index
- Synonymizercurrently only provides an interface
Contribution and support
If you want to contribute work to the project please subscribe to the developers mailing list and have a look at the contribution guidelinee. A list of people who have made contributions in the past can be found here.
Tests
The library provides unit tests that covers the core-functionality normally run by the
continues integration platform. Tests can also be executed manually using the
composer phpunit command from the root directory.
Release notes
- 0.1.0 Initial release (2016-08-07)
- Added SanitizerFactorywith support for a
- Tokenizer,- LanguageDetector,- Synonymizer, and- StopwordAnalyzerinterface
Acknowledgments
- The Transliteratoruses the same diacritics conversion table as http://jsperf.com/latinize (except the German diaeresis ä, ü, and ö)
- The stopwords used by the StopwordAnalyzerhave been collected from different sources, eachjsonfile identifies its origin
- CdbStopwordAnalyzerrelies on- wikimedia/cdbto avoid using an external database or cache layer (with extra stopwords being available here)
- JaTinySegmenterTokenizeris based on the work of Taku Kudo and his tiny_segmenter.js
- TextCatLanguageDetectoruses the- wikimedia/textcatlibrary to make predictions about a language