coral-media/php-ir

Information Retrieval algorithms (vector space, similarity, clustering)

Installs: 7

Dependents: 0

Suggesters: 0

Security: 0

Stars: 0

Watchers: 0

Forks: 0

Open Issues: 0

pkg:composer/coral-media/php-ir

v0.7.1 2026-01-07 12:51 UTC

This package is auto-updated.

Last update: 2026-01-07 12:52:32 UTC


README

PHP License

PHPStan PHPMD

GitHub last commit GitHub repo size

PHP-IR is a modern, research-oriented Information Retrieval (IR) and Vector Space Modeling library for PHP, focused on correctness, transparency, and theoretical grounding.

It provides low-level, composable primitives for text representation, weighting, similarity, clustering, and evaluation, designed for engineers who need full control and explainability, not opaque ML abstractions.

Why PHP-IR exists

The PHP ecosystem has historically lacked serious IR tooling beyond thin wrappers around search engines. PHP-IR fills that gap by offering:

  • Explicit vector space modeling
  • Reproducible term weighting pipelines
  • Deterministic clustering algorithms
  • Quantitative cluster quality metrics
  • APIs aligned with Information Retrieval literature

The goal is not convenience-first APIs, but scientifically correct and inspectable IR workflows.

Core capabilities

Text processing

  • Tokenization (regex, whitespace)
  • Text normalization (lowercasing, accent folding, composition)
  • Stop-word filtering with language support (English, Spanish)

Vocabulary & statistics

  • Vocabulary construction
  • Document frequency tracking
  • IDF computation (per-term and vectorized)
  • Corpus-level statistics via dedicated façades (no core pollution)

Vectorization

  • Sparse and dense vector representations
  • Term Frequency (TF)
  • TF-IDF weighting
  • Spherical (L2-normalized) vector spaces
  • Explicit densification for algorithms that require fixed dimensions

Similarity

  • Cosine similarity
  • Pluggable similarity interfaces

Clustering

  • Spherical K-Means
  • Spherical K-Medians (robust to outliers)
  • Deterministic centroid update strategies
  • Explicit iteration control
  • Centroid initialization and update policies

Cluster evaluation

  • Intra-cluster cohesion
  • Inter-cluster separation
  • Global quality score aligned with IR theory
  • Metrics designed for algorithm comparison, not just reporting

Design philosophy

PHP-IR is intentionally not:

  • A search engine
  • A machine learning framework
  • A black-box clustering toolkit

Instead, it provides clear, inspectable building blocks that let you:

  • Reason about every step of the IR pipeline
  • Swap strategies without side effects
  • Validate theoretical assumptions with executable code
  • Compare algorithms using quantitative invariants

If you are familiar with TF-IDF, cosine similarity, and clustering theory, PHP-IR should feel predictable and rigorous.

Theoretical foundation

The library is grounded in classical and modern IR research, including:

Current status

  • Actively developed
  • API stabilized through real-world usage
  • Strong test coverage with invariant-based tests
  • English and Spanish corpora used for validation
  • Designed to evolve without breaking theoretical guarantees

Detailed documentation, examples, and usage guides will be added incrementally.

Roadmap (high level)

  • Advanced convergence criteria beyond fixed iteration limits
  • Additional robustness heuristics for clustering
  • Optional serialization of evaluation artifacts
  • Extended language tooling and corpora support

License

MIT License.
Use it, extend it, and build on it responsibly.