token27/nexus-ai-tokenizer

Provider-agnostic LLM token counter for PHP 8.3+. Supports Tiktoken BPE (OpenAI, Claude approx.), HuggingFace JSON vocabularies (DeepSeek, LLaMA, Mistral), SentencePiece (Gemini), and extensible custom strategies.

Maintainers

Package info

github.com/token27/nexus-ai-tokenizer

pkg:composer/token27/nexus-ai-tokenizer

Statistics

Installs: 25

Dependents: 1

Suggesters: 3

Stars: 0

Open Issues: 0

v1.0.0 2026-05-20 06:06 UTC

This package is auto-updated.

Last update: 2026-05-21 06:34:05 UTC


README

CI PHPStan Level 8 Latest Version PHP 8.3+ License: MIT Tests

A universal, multi-provider PHP 8.3+ token counting and context estimation library. Manage and estimate context windows across fragmented AI models with an elegant and extensible API.

Why nexus-ai-tokenizer?

Different providers use completely different tokenization algorithms. OpenAI relies on Tiktoken (cl100k_base, o200k_base), DeepSeek uses HuggingFace BPE, Gemini uses SentencePiece, and Claude has a proprietary BPE.

nexus-ai-tokenizer solves this fragmentation by:

  • Zero-config OpenAI counting: Built-in, fast, exact token counting for all modern OpenAI models (gpt-4o, o1, gpt-3.5-turbo, etc.).
  • Consistent Interface: Handle any provider with one method (TokenizerEngine::for('model')).
  • Graceful Approximations: Seamlessly approximate limits for closed tokenizers (like Claude) using cl100k_base.
  • Exact Native Integration: Directly load .json HuggingFace and .model SentencePiece vocabularies when you need 100% exact math.
  • Multimodal Counting: Translates image resolutions and detail settings into exact token costs across providers.
  • Batching & Context Windows: Easy percentage checks, isWithinContextWindow(), and batch token processing.

Features

  • Built-in Catalog: Out-of-the-box support mapping 50+ mainstream models to the correct strategy.
  • Tiktoken Strategy: Rapid exact counts for o200k_base, cl100k_base, and more.
  • ChatML & Conversation Overhead: countChat() handles exact provider metadata framing.
  • Extensible Architecture: Define your own custom tokenizer strategies and dynamic providers.
  • Type Safety: PHPStan Level 8, immutable Value Objects.

Installation

composer require token27/nexus-ai-tokenizer

Requires: PHP 8.3+

Note: For exact SentencePiece tokenization (e.g. Gemini/Gemma), the textualization/sentencepiece extension is optionally required. For HuggingFace vocabularies, no extensions are needed.

Quick Start

1. Zero-config Token Counting

Count exact tokens for any OpenAI model right out of the box:

use Token27\Tokenizer\Engine\TokenizerEngine;

$text = 'Hello world! This is a test of the token counting library.';

// Automatically resolves 'gpt-4o' to Tiktoken (o200k_base)
$count = TokenizerEngine::for('gpt-4o')->count($text);

echo $count->count(); // 14
echo $count->strategy(); // tiktoken:o200k_base
echo $count->isApproximate() ? 'Yes' : 'No'; // No

2. Multi-turn Chat Counting

Accurately calculate token payloads including provider-specific overheads (role markers, ChatML syntax):

$messages = [
    ['role' => 'system', 'content' => 'You are a helpful assistant.'],
    ['role' => 'user', 'content' => 'What is tokenization?'],
];

$chat = TokenizerEngine::for('gpt-4-turbo')->countChat($messages);

echo $chat->contentTokens();  // 12
echo $chat->overheadTokens(); // 11
echo $chat->count();          // 23

3. Context Window Management

Verify if prompts will fit within a provider's window limit to handle automatic truncation or provider switching:

$count = TokenizerEngine::for('claude-3-opus-20240229')->count($hugeDocument);
$limit = 200_000;

if ($count->isWithinContextWindow($limit)) {
    echo "Fits! " . $count->remainingTokens($limit) . " tokens left.";
} else {
    echo "Too large by " . ($count->count() - $limit) . " tokens.";
}

4. Multimodal Image Tokens

Translates image dimensions to required token spend:

// OpenAI High-detail image token math
$imageCost = TokenizerEngine::for('gpt-4o')->estimateImage(1920, 1080, 'high');
echo $imageCost->count(); // e.g. 1105 tokens

// Anthropic logic
$claudeCost = TokenizerEngine::for('claude-sonnet-4-20250514')->estimateImage(1920, 1080);
echo $claudeCost->count(); // e.g. 2765 tokens

Documentation

License

MIT. See LICENSE.