codewithkyrian / tokenizers
Fast, pure-PHP tokenizer library compatible with Hugging Face tokenizers for encoding and decoding text
Installs: 4
Dependents: 0
Suggesters: 0
Security: 0
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
pkg:composer/codewithkyrian/tokenizers
Requires
- php: ^8.1
- php-http/discovery: ^1.19
- psr/http-client: ^1.0
- psr/http-factory: ^1.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.91
- guzzlehttp/guzzle: ^7.0
- pestphp/pest: ^2.36.0|^3.5.0
- phpstan/phpstan: ^2.1
- symfony/var-dumper: ^6.4.11|^7.1.5|^8.0
README
Tokenizers PHP
Tokenizers PHP is a lightweight, dependency-free PHP library for tokenizing text using the same tokenizers powering models on the Hugging Face Hub. Whether you're building LLM applications, search systems, or text processing pipelines, this library provides fast, accurate tokenization that matches the original model implementations.
Highlights
- Pure PHP — No FFI, no external binaries, no compiled extensions. Works everywhere PHP runs.
- Zero Hard Dependencies — Core tokenization has no required dependencies. Optional HTTP client needed only for Hub downloads.
- Hub Compatible — Load tokenizers directly from Hugging Face Hub or from local files.
- Fully Tested — Validated against BERT, GPT-2, Llama, Gemma, Qwen, RoBERTa, ALBERT, and more.
- Modern PHP — Built for PHP 8.1+ with strict types, readonly properties, and clean interfaces.
Installation
Install via Composer:
composer require codewithkyrian/tokenizers
HTTP Client (Optional)
If you plan to load tokenizers from the Hugging Face Hub, you'll need an HTTP client implementing PSR-18. We recommend Guzzle:
composer require guzzlehttp/guzzle
Note: The library uses PHP-HTTP Discovery to automatically find and use any PSR-18 compatible HTTP client installed in your project. If you're only loading tokenizers from local files, no HTTP client is needed.
Quick Start
use Codewithkyrian\Tokenizers\Tokenizer; // Load a tokenizer from Hugging Face Hub $tokenizer = Tokenizer::fromHub('bert-base-uncased'); // Encode text to token IDs $encoding = $tokenizer->encode('Hello, how are you?'); echo implode(', ', $encoding->ids); // 101, 7592, 1010, 2129, 2024, 2017, 1029, 102 echo implode(', ', $encoding->tokens); // [CLS], hello, ,, how, are, you, ?, [SEP] // Decode token IDs back to text $text = $tokenizer->decode($encoding->ids); echo $text; // "[CLS] hello, how are you? [SEP]"
Loading Tokenizers
Tokenizers PHP provides multiple ways to load tokenizers depending on your use case.
From Hugging Face Hub
Load any tokenizer from the Hugging Face Hub by providing the model ID:
use Codewithkyrian\Tokenizers\Tokenizer; // Load a popular model $tokenizer = Tokenizer::fromHub('bert-base-uncased'); // Load a model from an organization $tokenizer = Tokenizer::fromHub('meta-llama/Llama-3.1-8B-Instruct'); // With options $tokenizer = Tokenizer::fromHub( modelId: 'openai/gpt-oss-20b', cacheDir: '/path/to/cache', // Custom cache directory revision: 'main', // Branch, tag, or commit hash token: 'hf_...' // Auth token for private models );
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
modelId |
string |
— | The model identifier on Hugging Face Hub (e.g., bert-base-uncased or org/model-name) |
cacheDir |
?string |
null |
Custom directory for caching downloaded files. Defaults to system cache directory |
revision |
?string |
'main' |
Specific version to load—can be a branch name, tag, or commit hash |
token |
?string |
null |
Hugging Face authentication token for accessing private or gated models |
Cache Directory Resolution
When cacheDir is not specified, the library automatically resolves the cache location:
- Environment Variable —
TOKENIZERS_CACHEif set - macOS —
~/Library/Caches/huggingface/tokenizers - Linux —
$XDG_CACHE_HOME/huggingface/tokenizersor~/.cache/huggingface/tokenizers - Windows —
%LOCALAPPDATA%\huggingface\tokenizers
From Local Files
Load tokenizers from local JSON files:
use Codewithkyrian\Tokenizers\Tokenizer; // Single file (tokenizer.json with all config merged) $tokenizer = Tokenizer::fromFile('/path/to/tokenizer.json'); // Multiple files (configs are merged, later files override earlier ones) $tokenizer = Tokenizer::fromFile( '/path/to/tokenizer.json', '/path/to/tokenizer_config.json' );
This is useful when you've downloaded model files manually or are working in an offline environment.
From Configuration Array
Build a tokenizer from a raw configuration array:
use Codewithkyrian\Tokenizers\Tokenizer; $config = json_decode(file_get_contents('tokenizer.json'), true); $tokenizer = Tokenizer::fromConfig($config);
Universal Loader
The load() method provides a convenient unified interface:
use Codewithkyrian\Tokenizers\Tokenizer; // Automatically detects the source type $tokenizer = Tokenizer::load('bert-base-uncased'); // From Hub $tokenizer = Tokenizer::load('/path/to/tokenizer.json'); // From file $tokenizer = Tokenizer::load($configArray); // From array
Accessing Configuration
The tokenizer stores its configuration and provides access via getConfig():
$tokenizer = Tokenizer::fromHub('bert-base-uncased'); // Get a specific config value $maxLength = $tokenizer->getConfig('model_max_length'); // 512 $cleanup = $tokenizer->getConfig('clean_up_tokenization_spaces'); // true $custom = $tokenizer->getConfig('unknown_key', 'default'); // 'default' // Convenience property for model_max_length echo $tokenizer->modelMaxLength; // 512 // Get all configuration (pass null or no arguments) $allConfig = $tokenizer->getConfig();
Common configuration keys:
model_max_length— Maximum sequence lengthremove_space— Whether to remove leading/trailing spacesdo_lowercase_and_remove_accent— Whether to lowercase and strip accentsclean_up_tokenization_spaces— Whether to clean up spaces during decoding
Note:
model_max_lengthis the tokenizer's configured max length, not necessarily the model's actual context window. For most models, these are the same. However, some tokenizers (like Llama 3) set this to an extremely large value. When building applications, you may want to use known context window limits for specific models rather than relying solely on this value.
Encoding Text
The encode() method tokenizes text and returns an Encoding object containing the token IDs, tokens, and type IDs.
$encoding = $tokenizer->encode('The quick brown fox jumps over the lazy dog.');
The Encoding Object
$encoding->ids; // int[] - Token IDs: [101, 1996, 4248, 2829, 4419, ...] $encoding->tokens; // string[] - Tokens: ['[CLS]', 'the', 'quick', 'brown', ...] $encoding->typeIds; // int[] - Segment IDs for sentence pairs: [0, 0, 0, ...]
Encoding Options
$encoding = $tokenizer->encode( text: 'First sentence.', textPair: 'Second sentence.', // Optional second text for pair encoding addSpecialTokens: true // Whether to add [CLS], [SEP], etc. (default: true) );
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
string |
— | The primary text to tokenize |
textPair |
?string |
null |
Optional second text for sequence pair tasks (e.g., question-answering) |
addSpecialTokens |
bool |
true |
Whether to add model-specific special tokens (like [CLS], [SEP]) |
Sentence Pairs
For tasks involving two text sequences (like question-answering or natural language inference), pass both texts:
$encoding = $tokenizer->encode( text: 'What is the capital of France?', textPair: 'Paris is the capital of France.' ); // tokens: ['[CLS]', 'what', 'is', 'the', 'capital', 'of', 'france', '?', '[SEP]', // 'paris', 'is', 'the', 'capital', 'of', 'france', '.', '[SEP]'] // typeIds: [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
The typeIds distinguish between the first sequence (0) and the second sequence (1), which many models use during attention computation.
Decoding Tokens
Convert token IDs back to human-readable text:
$text = $tokenizer->decode([101, 7592, 1010, 2129, 2024, 2017, 1029, 102]); // "hello, how are you?"
Decoding Options
$text = $tokenizer->decode( ids: $encoding->ids, skipSpecialTokens: true, // Remove [CLS], [SEP], etc. (default: true) cleanup: null // Override cleanup behavior (default: use model config) );
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
ids |
int[] |
— | Array of token IDs to decode |
skipSpecialTokens |
bool |
true |
Whether to exclude special tokens from the output |
cleanup |
?bool |
null |
Whether to clean up tokenization artifacts (extra spaces). Uses model's config when null |
Cleanup Behavior
The cleanup parameter controls whether tokenization artifacts are cleaned:
// With cleanup (default when model config says so) $tokenizer->decode($ids, cleanup: true); // "hello, how are you?" // Without cleanup $tokenizer->decode($ids, cleanup: false); // "hello , how are you ?"
When cleanup is null, the library respects the clean_up_tokenization_spaces setting from the model's configuration.
Custom Tokenizers with the Builder
For advanced use cases, build tokenizers from scratch using the fluent builder API:
use Codewithkyrian\Tokenizers\Tokenizer; use Codewithkyrian\Tokenizers\Models\WordPieceModel; use Codewithkyrian\Tokenizers\Normalizers\LowercaseNormalizer; use Codewithkyrian\Tokenizers\PreTokenizers\WhitespacePreTokenizer; use Codewithkyrian\Tokenizers\PostProcessors\BertPostProcessor; use Codewithkyrian\Tokenizers\Decoders\WordPieceDecoder; $vocab = ['[UNK]' => 0, '[CLS]' => 1, '[SEP]' => 2, 'hello' => 3, 'world' => 4, ...]; $tokenizer = Tokenizer::builder() ->withModel(new WordPieceModel($vocab, '[UNK]')) ->withNormalizer(new LowercaseNormalizer()) ->withPreTokenizer(new WhitespacePreTokenizer()) ->withPostProcessor(new BertPostProcessor('[CLS]', '[SEP]')) ->withDecoder(new WordPieceDecoder()) ->withSpecialTokens(['[UNK]', '[CLS]', '[SEP]', '[PAD]', '[MASK]']) ->withConfig('model_max_length', 512) ->withConfig('clean_up_tokenization_spaces', true) ->build();
Builder Methods
| Method | Description |
|---|---|
withModel(ModelInterface $model) |
Required. Set the tokenization model (BPE, WordPiece, Unigram) |
withNormalizer(NormalizerInterface $normalizer) |
Set text normalizer. Defaults to PassThroughNormalizer |
withPreTokenizer(PreTokenizerInterface $preTokenizer) |
Set pre-tokenizer. Defaults to IdentityPreTokenizer |
withPostProcessor(PostProcessorInterface $postProcessor) |
Set post-processor. Defaults to DefaultPostProcessor |
withDecoder(DecoderInterface $decoder) |
Set decoder. Defaults to FuseDecoder |
withAddedTokens(array $tokens) |
Add extra tokens to the vocabulary |
withSpecialTokens(array $tokens) |
Define special tokens (skipped during decode by default) |
withConfig(string $key, mixed $value) |
Set a configuration value (see common keys below) |
build() |
Build and return the Tokenizer instance |
Common config keys for withConfig():
'model_max_length'— Maximum sequence length'remove_space'— Remove leading/trailing spaces before normalization'do_lowercase_and_remove_accent'— Lowercase and strip accents'clean_up_tokenization_spaces'— Clean up spaces during decoding
The Tokenization Pipeline
Understanding the tokenization pipeline helps when debugging or customizing behavior. Each input text passes through these stages:
┌─────────────────────────────────────────────────────────────────────┐
│ Input Text │
│ "Hello, how are you doing?" │
└──────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 1. Normalization │
│ • Unicode normalization (NFC, NFKC, NFD, NFKD) │
│ • Lowercase transformation │
│ • Accent stripping │
│ • Control character removal │
│ │
│ → "hello, how are you doing?" │
└──────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 2. Pre-tokenization │
│ • Split on whitespace and/or punctuation │
│ • Identify word boundaries │
│ │
│ → ["hello", ",", "how", "are", "you", "doing", "?"] │
└──────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 3. Model Tokenization │
│ • BPE: Byte-Pair Encoding merges │
│ • WordPiece: Greedy longest-match-first │
│ • Unigram: Probabilistic subword selection │
│ │
│ → ["hello", ",", "how", "are", "you", "do", "##ing", │
│ "?"] │
└──────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 4. Post-processing │
│ • Add special tokens ([CLS], [SEP], <s>, </s>, etc.) │
│ • Generate token type IDs for sentence pairs │
│ │
│ → ["[CLS]", "hello", ",", "how", "are", "you", "do", │
│ "##ing", "?", "[SEP]"] │
└──────────────────────────┬──────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ 5. ID Mapping │
│ • Convert tokens to numerical IDs using vocabulary │
│ │
│ → [101, 7592, 1010, 2129, 2024, 2017, 2079, 2075, │
│ 1029, 102] │
└─────────────────────────────────────────────────────────────────────┘
Components Reference
Normalizers
Normalizers clean and standardize input text before tokenization.
| Normalizer | Description |
|---|---|
BertNormalizer |
BERT-style: clean text, handle Chinese chars, lowercase, strip accents |
LowercaseNormalizer |
Convert all characters to lowercase |
NFCNormalizer |
Unicode NFC normalization |
NFKCNormalizer |
Unicode NFKC normalization |
NFKDNormalizer |
Unicode NFKD normalization |
StripNormalizer |
Strip leading/trailing whitespace |
StripAccentsNormalizer |
Remove accent marks from characters |
ReplaceNormalizer |
Replace patterns or strings |
PrependNormalizer |
Prepend a string to the input |
PrecompiledNormalizer |
Use precompiled normalization rules (for SentencePiece models) |
NormalizerSequence |
Chain multiple normalizers together |
PassThroughNormalizer |
No-op, passes text through unchanged |
Pre-tokenizers
Pre-tokenizers split text into smaller chunks before subword tokenization.
| Pre-tokenizer | Description |
|---|---|
BertPreTokenizer |
Split on whitespace and punctuation (BERT-style) |
ByteLevelPreTokenizer |
Convert to byte-level representation (GPT-2 style) |
WhitespacePreTokenizer |
Split on whitespace characters |
WhitespaceSplit |
Split only on whitespace, keep punctuation attached |
MetaspacePreTokenizer |
Replace spaces with ▁ (SentencePiece style) |
PunctuationPreTokenizer |
Split on punctuation characters |
DigitsPreTokenizer |
Isolate digit sequences |
SplitPreTokenizer |
Split using custom regex patterns |
PreTokenizerSequence |
Chain multiple pre-tokenizers together |
IdentityPreTokenizer |
No-op, returns text unchanged |
Models
Models perform the core subword tokenization algorithm.
| Model | Description |
|---|---|
BPEModel |
Byte-Pair Encoding - iteratively merges most frequent pairs |
WordPieceModel |
Greedy longest-match-first subword tokenization (BERT) |
UnigramModel |
Probabilistic subword selection (SentencePiece) |
FallbackModel |
Simple vocabulary lookup with unknown token fallback |
Post-processors
Post-processors add special tokens and structure to the tokenized output.
| Post-processor | Description |
|---|---|
BertPostProcessor |
Add [CLS] and [SEP] tokens |
RobertaPostProcessor |
Add <s> and </s> tokens with spacing |
TemplatePostProcessor |
Flexible template-based token insertion |
ByteLevelPostProcessor |
Handle byte-level special tokens |
PostProcessorSequence |
Chain multiple post-processors |
DefaultPostProcessor |
Minimal processing, no tokens added |
Decoders
Decoders convert tokens back to readable text.
| Decoder | Description |
|---|---|
ByteLevelDecoder |
Decode byte-level tokens back to UTF-8 |
WordPieceDecoder |
Handle ## continuation prefixes |
MetaspaceDecoder |
Convert ▁ back to spaces |
BPEDecoder |
Handle BPE-specific suffixes and spaces |
CTCDecoder |
Decode CTC (Connectionist Temporal Classification) output |
FuseDecoder |
Simply join tokens with optional separator |
ReplaceDecoder |
Replace specific patterns during decode |
StripDecoder |
Strip specific characters |
ByteFallbackDecoder |
Handle byte fallback tokens (e.g., <0x00>) |
DecoderSequence |
Chain multiple decoders together |
Extending the Library
All components implement simple interfaces that you can extend:
use Codewithkyrian\Tokenizers\Contracts\NormalizerInterface; class CustomNormalizer implements NormalizerInterface { public function normalize(string $text): string { // Your custom normalization logic return $modifiedText; } }
Available interfaces:
NormalizerInterface— Text normalizationPreTokenizerInterface— Pre-tokenization splittingModelInterface— Core tokenization algorithmPostProcessorInterface— Post-processing and special tokensDecoderInterface— Token-to-text conversion
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
# Clone the repository git clone https://github.com/codewithkyrian/tokenizers-php.git cd tokenizers-php # Install dependencies composer install # Run tests vendor/bin/pest
License
This project is licensed under the MIT License - see the LICENSE file for details.
Credits
- Kyrian Obikwelu — Creator and maintainer
- Hugging Face — Tokenizers specification and model hosting
- All contributors
Made with ❤️ for the PHP community