yethee/tiktoken

PHP version of tiktoken

0.4.0 2024-04-30 14:30 UTC

This package is auto-updated.

Last update: 2024-05-02 08:00:54 UTC


README

Packagist Version Build status License

This is a port of the tiktoken.

Installation

$ composer require yethee/tiktoken

Usage

use Yethee\Tiktoken\EncoderProvider;

$provider = new EncoderProvider();

$encoder = $provider->getForModel('gpt-3.5-turbo-0301');
$tokens = $encoder->encode('Hello world!');
print_r($tokens);
// OUT: [9906, 1917, 0]

$encoder = $provider->get('p50k_base');
$tokens = $encoder->encode('Hello world!');
print_r($tokens);
// OUT: [15496, 995, 0]

Cache

The encoder uses an external vocabularies, so caching is used by default to avoid performance issues.

By default, the directory for temporary files is used. You can override the directory for cache via environment variable TIKTOKEN_CACHE_DIR or use EncoderProvider::setVocabCache():

use Yethee\Tiktoken\EncoderProvider;

$encProvider = new EncoderProvider();
$encProvider->setVocabCache('/path/to/cache');

// Using the provider

Disable cache

You can disable the cache, if there are reasons for this, in one of the following ways:

  • Set an empty string for the environment variable TIKTOKEN_CACHE_DIR.
  • Programmatically:
use Yethee\Tiktoken\EncoderProvider;

$encProvider = new EncoderProvider();
$encProvider->setVocabCache(null); // disable the cache

Limitations

  • Encoding for GPT-2 is not supported.
  • Special tokens (like <|endofprompt|>) are not supported.

License

MIT