webrek / tokenizers
Native byte-level BPE (tiktoken-compatible: cl100k_base/o200k_base) plus WordPiece and Unigram tokenizers for PHP 8.3+, with a process-shared vocab cache. Loads HuggingFace tokenizer.json models and counts Claude/Gemini tokens via their APIs.
Package info
Language:C
Type:php-ext
Ext name:ext-tokenizers
pkg:composer/webrek/tokenizers
Requires
- php: >=8.3
README
A native PHP extension that counts, encodes, and decodes LLM tokens — byte-exact with the reference tokenizers, fast, and with no Rust toolchain. Plus a pure-PHP companion that counts Claude and Gemini tokens through their official APIs.
Think of it as a scale for AI text: before you send a prompt to an LLM, weigh it in tokens so you know what it will cost and whether it fits the model's context window — all from PHP, exactly, without rebuilding a 26 MB vocabulary on every request.
use Tokenizers\Encoding; $enc = Encoding::load('cl100k_base'); // GPT-4 / GPT-4o-class encoding echo $enc->countTokens('Hello, world! 🎉'); // 7
Why this extension?
| Property | Pure-PHP tiktoken port | This extension |
|---|---|---|
| Memory per worker | ~26 MB vocab rebuilt every request | Loaded once per worker process |
| Worst-case latency | O(n²) per pre-token piece | O(n log n) heap-based merge |
| Install | Pure PHP | Single .so — no Rust, no ffi.enable |
| Accuracy | Approximate | Byte-exact vs the reference (tiktoken / BERT / T5) |
| Models | OpenAI only | OpenAI + BERT + T5 + Llama/Mistral/Qwen… + Claude/Gemini via API |
The wins are memory, worst-case latency, accuracy, and installability — not raw throughput on tiny inputs. For prompts that fit in a tweet, a pure-PHP port can be faster (no extension-call overhead). This extension is for workloads where the 26 MB-per-worker overhead, adversarial inputs, or byte-exactness actually matter.
Supported models
Locally, byte-exact
| Algorithm | Models | How to load |
|---|---|---|
| BPE (tiktoken) | GPT-4, GPT-4o (text), o1, o3 | Encoding::load('cl100k_base') |
| BPE (tiktoken) | GPT-4o (multimodal), o1 mini/pro | Encoding::load('o200k_base') |
| BPE (HuggingFace) | GPT-2, RoBERTa, Llama 3, Mistral (tekken), Qwen, DeepSeek | Encoding::fromHuggingFace('tokenizer.json') |
| WordPiece | BERT family (bert-base-uncased, …) | Encoding::fromHuggingFace('tokenizer.json') |
| Unigram | T5, ALBERT (SentencePiece) | Encoding::fromHuggingFace('tokenizer.json') |
Conformance is verified byte-for-byte against Python tiktoken, HuggingFace
BertTokenizerFast, and t5-small. Any diff against the committed fixtures fails
CI. See Status & conformance.
Remotely (no public tokenizer)
Claude 3+ and Gemini do not publish their tokenizers — there is no local
vocabulary to load. The pure-PHP companion (Tokenizers\Remote\Anthropic,
Tokenizers\Remote\Gemini, Tokenizers\TokenCounter) counts their tokens through
the providers' official count_tokens endpoints. It works without building the
C extension. See the Remote providers guide.
Install
phpize && ./configure && make && make install
Then enable it in your php.ini:
extension=tokenizers
Verify:
php -m | grep tokenizers # → tokenizers
pecl install tokenizers and pie install webrek/tokenizers are also supported.
Full instructions, requirements (libpcre2), and troubleshooting are in
Getting Started.
Quick start
use Tokenizers\Encoding; // OpenAI encoding (vocab downloads + caches on first use) $enc = Encoding::load('cl100k_base'); $n = $enc->countTokens($prompt); // count without allocating the id array $ids = $enc->encode($prompt); // int[] $str = $enc->decode($ids); // round-trips for plain text // HuggingFace model — returns Bpe | WordPiece | Unigram by model type $bert = Encoding::fromHuggingFace('/path/to/bert/tokenizer.json'); $t5 = Encoding::fromHuggingFace('/path/to/t5/tokenizer.json'); // One facade for local + remote, routed by model name use Tokenizers\TokenCounter; $tc = new TokenCounter(); $tc->count('cl100k_base', $text); // local BPE, no key $tc->count('claude-opus-4-8', $text); // remote Anthropic (needs ANTHROPIC_API_KEY) $tc->count('gemini-1.5-flash',$text); // remote Gemini (needs GEMINI_API_KEY)
Documentation
| Guide | What it covers |
|---|---|
| Getting Started | Install, enable, verify, first tokenization, troubleshooting |
| Loading models | OpenAI / HuggingFace BPE, WordPiece, Unigram, options, the cache |
| Estimating LLM costs | Budget spend, fit context windows, track usage per client |
| Remote providers (Claude / Gemini) | Counting tokens via the provider APIs |
| API reference | Every class, method, and function |
| Status & limitations | What's verified, conformance results, honest limits, roadmap |
Project status
v0.1.0, early but functional. All three planned phases are complete and merged:
- BPE (cl100k_base, o200k_base, HuggingFace BPE) — byte-exact, O(n log n) merge.
- WordPiece (BERT) and Unigram (T5/SentencePiece) — byte-exact.
- Claude / Gemini API companion — pure PHP, standalone.
Honest caveats live in Status & limitations (normalization scope, PIE install not yet verified end-to-end, remote counting needs a network call + key).
License
Apache-2.0 — see LICENSE. Vocabulary files for built-in encodings are downloaded from OpenAI's public CDN at runtime, checksum-verified, and not redistributed with the extension.