linguistic/ngramextractor

Extracts ngrams from a given text and does linguistic pre-processing like stopword removal

Maintainers

Package info

github.com/linguistic-dev/n-gram-extractor

pkg:composer/linguistic/ngramextractor

Statistics

Installs: 2 635

Dependents: 0

Suggesters: 0

Stars: 3

Open Issues: 0

dev-master 2017-12-05 23:09 UTC

This package is not auto-updated.

Last update: 2026-03-29 12:31:12 UTC


README

Installation

Simple install via Composer:

composer require linguistic/ngramextractor

Usage

Coming soon.

Example

$tokenizer = new Tokenizer();
$tokenizer->addRemovalRule('/<\/?\w+[\s\w\=\"\/\#\-\:\.\_]*>/') # Removes HTML Tags
->addRemovalRule('/[^a-z0-9]+/', ' ') # Replaces everything which is not text with a space
->setSeperator('/\s+/'); # Tokenizes text with whitespace as delimiter
$content = ""; # The text that should get tokenized
$stopwords = array(); # (optional) array of stopwords

$extractor = new NGramExtractor($content, $tokenizer, $stopwords);
$unigrams    = $extractor->getNGrams(1); # gets all n-grams in the text, n = 1

$unigramsFiltered    = NGramExtractor::limitByOccurance($extractor->getNGramCount(1, true), 3); # get unigrams and their occurance if the occurance is greater or equal 3

Ressources