purewater2011 / tiktoken-php7
PHP 7.4+ compatible version of tiktoken - OpenAI's tiktoken tokenizer ported to PHP
Requires
- php: ^7.4
- symfony/service-contracts: ^2.5 || ^3.0
Requires (Dev)
- doctrine/coding-standard: ^9.0 || ^10.0 || ^11.0
- mikey179/vfsstream: ^1.6.11
- phpbench/phpbench: ^1.0
- phpunit/phpunit: ^9.5 || ^10.0
- psalm/plugin-phpunit: ^0.18.4
- vimeo/psalm: ^4.30 || ^5.0
Suggests
- ext-ffi: To allow use LibEncoder
This package is auto-updated.
Last update: 2025-09-10 08:06:54 UTC
README
A PHP 7.4+ compatible port of OpenAI's tiktoken tokenizer.
This package is a backward-compatible fork that brings tiktoken functionality to PHP 7.4+, making it accessible to projects that haven't yet migrated to PHP 8.1+.
Features
- ✅ PHP 7.4+ compatibility (downgraded from PHP 8.1+)
- ✅ Support for all OpenAI models (GPT-3.5, GPT-4, GPT-4o, etc.)
- ✅ Multiple encoding formats (r50k_base, p50k_base, cl100k_base, o200k_base)
- ✅ Efficient caching system
- ✅ Optional FFI-based native library support for better performance
- ✅ Full compatibility with original tiktoken API
Installation
composer require purewater2011/tiktoken-php7
Requirements
- PHP 7.4 or higher
ext-ffi
(optional, for LibEncoder performance boost)
Quick Start
<?php use Purewater2011\TiktokenPhp7\EncoderProvider; $provider = new EncoderProvider(); // Get encoder for a specific model $encoder = $provider->getForModel('gpt-3.5-turbo'); $tokens = $encoder->encode('Hello, world!'); print_r($tokens); // Output: [9906, 11, 1917, 0] // Decode tokens back to text $text = $encoder->decode($tokens); echo $text; // Output: "Hello, world!" // Get encoder by encoding name $encoder = $provider->get('cl100k_base'); $tokens = $encoder->encode('Hello, world!'); print_r($tokens); // Output: [9906, 11, 1917, 0]
Supported Models
This package supports all current OpenAI models:
Model Family | Encoding |
---|---|
GPT-4o, GPT-4o mini | o200k_base |
GPT-4, GPT-3.5-turbo | cl100k_base |
GPT-3 (Davinci, Curie, etc.) | p50k_base |
GPT-3 (Ada, Babbage) | r50k_base |
Advanced Usage
Encoding in Chunks
For processing large texts, you can encode in chunks:
$encoder = $provider->getForModel('gpt-4'); $chunks = $encoder->encodeInChunks($largeText, 1000); // Max 1000 tokens per chunk foreach ($chunks as $chunk) { echo "Chunk has " . count($chunk) . " tokens\n"; }
Custom Cache Directory
By default, vocabulary files are cached in the system temp directory. You can customize this:
// Via environment variable putenv('TIKTOKEN_CACHE_DIR=/path/to/cache'); // Or via method call $provider = new EncoderProvider(); $provider->setVocabCache('/path/to/cache');
Using Custom Vocabulary Loader
use Purewater2011\TiktokenPhp7\Vocab\Loader\DefaultVocabLoader; $provider = new EncoderProvider(); $provider->setVocabLoader(new DefaultVocabLoader('/custom/cache/path'));
Performance Optimization with LibEncoder (Experimental)
For better performance with large texts, you can use the FFI-based LibEncoder:
use Purewater2011\TiktokenPhp7\Encoder\LibEncoder; use Purewater2011\TiktokenPhp7\EncoderProvider; // Initialize the library path LibEncoder::init('/path/to/libtiktoken_php.so'); // Use LibEncoder for better performance $provider = new EncoderProvider(true); $encoder = $provider->getForModel('gpt-4');
Building the Native Library
If you want to use LibEncoder, you need to build the Rust library:
Requirements
- Rust >= 1.85
Build Steps
git clone https://github.com/purewater2011/tiktoken-php7.git
cd tiktoken-php7
cargo build --release
Copy the appropriate binary:
libtiktoken_php.so
(Linux)libtiktoken_php.dylib
(macOS)tiktoken_php.dll
(Windows)
Token Counting Examples
$provider = new EncoderProvider(); $encoder = $provider->getForModel('gpt-3.5-turbo'); // Count tokens in a message $message = "How many tokens is this?"; $tokenCount = count($encoder->encode($message)); echo "Token count: $tokenCount\n"; // Useful for staying within API limits $maxTokens = 4096; $prompt = "Your long prompt here..."; $promptTokens = count($encoder->encode($prompt)); if ($promptTokens > $maxTokens) { echo "Prompt too long! Tokens: $promptTokens, Max: $maxTokens\n"; }
Differences from Original
This package maintains full API compatibility with the original yethee/tiktoken
but with these key changes:
- PHP 7.4+ compatibility instead of PHP 8.1+
- Updated namespace:
Purewater2011\TiktokenPhp7
instead ofYethee\Tiktoken
- Compatible dependency versions for PHP 7.4
- All modern PHP 8.1+ syntax converted to PHP 7.4 compatible code
Migration Guide
If you're migrating from yethee/tiktoken
:
-
Update your composer requirement:
composer remove yethee/tiktoken composer require purewater2011/tiktoken-php7
-
Update namespace imports:
// Old use Yethee\Tiktoken\EncoderProvider; // New use Purewater2011\TiktokenPhp7\EncoderProvider;
-
All other usage remains identical!
Limitations
- GPT-2 encoding is not supported
- Special tokens (like
<|endofprompt|>
) are not supported - LibEncoder::encodeInChunks() method is not yet implemented
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Credits
- Original tiktoken implementation by OpenAI
- PHP port by yethee
- PHP 7.4 compatibility by purewater2011