byjg/text-classifier

There is no license information available for the latest version (6.0.0) of this package.

A PHP text classifier supporting binary spam filtering (Robinson-Fisher Bayesian) and multi-class Naive Bayes classification, with optional LLM-assisted active learning fallback.

Maintainers

Package info

github.com/byjg/php-text-classifier

Homepage

pkg:composer/byjg/text-classifier

Fund package maintenance!

byjg

Statistics

Installs: 0

Dependents: 0

Suggesters: 0

Stars: 9

Open Issues: 0

6.0.0 2026-03-07 17:51 UTC

This package is auto-updated.

Last update: 2026-03-10 17:58:36 UTC


README

sidebar_key tags
text-classifier
php text-classification ai

text-classifier — Bayesian Text Classifier

A PHP library for statistical text classification. Provides two independent engines:

Sponsor Build Status Opensource ByJG GitHub source GitHub license GitHub release

  • BinaryClassifier — Binary Robinson-Fisher Bayesian filter. Classifies text as spam or ham. Designed for high-accuracy two-class filtering with word degeneration support.
  • NaiveBayes — Multi-class Naive Bayes classifier. Classifies text into any number of user-defined categories. Suitable for language detection, topic tagging, content routing, and similar tasks.

Both engines return a ClassificationResult with the winning category, confidence score, and all category scores. Both support optional LLM injection for automatic escalation when the statistical model is uncertain — the LLM decision is fed back as training data, improving the model over time (active learning).

Both engines share the same tokenisation pipeline (StandardLexer, StandardDegenerator) and support pluggable storage backends (in-memory, SQLite, MySQL, PostgreSQL, GDBM).

Installation

composer require byjg/text-classifier

Requires PHP >=8.3. The GDBM storage backend additionally requires ext-dba.

Quick Example

Spam filter:

use ByJG\TextClassifier\BinaryClassifier;
use ByJG\TextClassifier\ConfigBinaryClassifier;
use ByJG\TextClassifier\Lexer\StandardLexer;
use ByJG\TextClassifier\Lexer\ConfigLexer;
use ByJG\TextClassifier\Degenerator\StandardDegenerator;
use ByJG\TextClassifier\Degenerator\ConfigDegenerator;
use ByJG\TextClassifier\Storage\Rdbms;
use ByJG\Util\Uri;

$storage = new Rdbms(new Uri('sqlite:///tmp/spam.db'), new StandardDegenerator(new ConfigDegenerator()));
$storage->createDatabase();

$classifier = new BinaryClassifier(new ConfigBinaryClassifier(), $storage, new StandardLexer(new ConfigLexer()));

$classifier->learn('Buy cheap pills now!!!', BinaryClassifier::SPAM);
$classifier->learn('Meeting at 3pm in the conference room', BinaryClassifier::HAM);

$result = $classifier->classify('buy pills online cheap');
// $result->choice === 'spam'
// $result->score  is close to 1.0

Multi-class classifier:

use ByJG\TextClassifier\NaiveBayes\NaiveBayes;
use ByJG\TextClassifier\NaiveBayes\Storage\Memory;
use ByJG\TextClassifier\Lexer\StandardLexer;
use ByJG\TextClassifier\Lexer\ConfigLexer;

$nb = new NaiveBayes(new Memory(), new StandardLexer(new ConfigLexer()));

$nb->train('PHP is a programming language', 'tech');
$nb->train('The cat sat on the mat', 'animals');

$result = $nb->classify('programming language');
// $result->choice          === 'tech'
// $result->score           === 0.93
// $result->scores          === ['tech' => 0.93, 'animals' => 0.07]

Documentation

Section Description
Getting Started Installation, requirements, first working example
Guides: Spam Filter Training, classifying, choosing storage
Guides: Multi-class Training categories, classifying, persistence
Guide: LLM-Assisted Classification Automatic LLM fallback and active learning
Concepts How the algorithms work, architecture overview
Reference Full API, configuration parameters, error codes

Acknowledgements

This library is inspired by the original b8 spam filter written by Tobias Leupold. The core algorithm, Robinson-Fisher probability model, token degeneration approach, and the tc* internal variable convention all originate from his work. This project modernises the codebase for PHP 8.3+, replaces the storage layer with byjg/micro-orm and byjg/migration, and adds a multi-class NaiveBayes engine built on the same tokenisation pipeline.

Dependencies

flowchart TD
    byjg/text-classifier --> byjg/micro-orm
    byjg/text-classifier --> byjg/migration
    byjg/text-classifier --> byjg/llm-api-objects
    byjg/text-classifier --> openai-php/client
Loading