linguistic / ngramextractor
Extracts ngrams from a given text and does linguistic pre-processing like stopword removal
Installs: 2 048
Dependents: 0
Suggesters: 0
Security: 0
Stars: 3
Watchers: 0
Forks: 0
Open Issues: 0
Requires (Dev)
- phpunit/phpunit: ^6.4
This package is not auto-updated.
Last update: 2025-08-31 09:55:31 UTC
README
Installation
Simple install via Composer:
composer require linguistic/ngramextractor
Usage
Coming soon.
Example
$tokenizer = new Tokenizer(); $tokenizer->addRemovalRule('/<\/?\w+[\s\w\=\"\/\#\-\:\.\_]*>/') # Removes HTML Tags ->addRemovalRule('/[^a-z0-9]+/', ' ') # Replaces everything which is not text with a space ->setSeperator('/\s+/'); # Tokenizes text with whitespace as delimiter
$content = ""; # The text that should get tokenized $stopwords = array(); # (optional) array of stopwords $extractor = new NGramExtractor($content, $tokenizer, $stopwords); $unigrams = $extractor->getNGrams(1); # gets all n-grams in the text, n = 1 $unigramsFiltered = NGramExtractor::limitByOccurance($extractor->getNGramCount(1, true), 3); # get unigrams and their occurance if the occurance is greater or equal 3