yethee / tiktoken
PHP version of tiktoken
Installs: 780 919
Dependents: 8
Suggesters: 0
Security: 0
Stars: 109
Watchers: 3
Forks: 24
Open Issues: 4
Requires
- php: ^8.1
- symfony/service-contracts: ^2.5 || ^3.0
Requires (Dev)
- doctrine/coding-standard: ^12.0
- mikey179/vfsstream: ^1.6.11
- phpbench/phpbench: ^1.2
- phpunit/phpunit: ^10.5.20
- psalm/plugin-phpunit: ^0.19.0
- vimeo/psalm: 5.26.1
README
This is a port of the tiktoken.
Installation
$ composer require yethee/tiktoken
Usage
use Yethee\Tiktoken\EncoderProvider; $provider = new EncoderProvider(); $encoder = $provider->getForModel('gpt-3.5-turbo-0301'); $tokens = $encoder->encode('Hello world!'); print_r($tokens); // OUT: [9906, 1917, 0] $encoder = $provider->get('p50k_base'); $tokens = $encoder->encode('Hello world!'); print_r($tokens); // OUT: [15496, 995, 0]
Cache
The encoder uses an external vocabularies, so caching is used by default to avoid performance issues.
By default, the directory for temporary files is used.
You can override the directory for cache via environment variable TIKTOKEN_CACHE_DIR
or use EncoderProvider::setVocabCache()
:
use Yethee\Tiktoken\EncoderProvider; $encProvider = new EncoderProvider(); $encProvider->setVocabCache('/path/to/cache'); // Using the provider
Disable cache
You can disable the cache, if there are reasons for this, in one of the following ways:
- Set an empty string for the environment variable
TIKTOKEN_CACHE_DIR
. - Programmatically:
use Yethee\Tiktoken\EncoderProvider; $encProvider = new EncoderProvider(); $encProvider->setVocabCache(null); // disable the cache
Limitations
- Encoding for GPT-2 is not supported.
- Special tokens (like
<|endofprompt|>
) are not supported.