aisk/mistral-tokenizer

PHP implementation of Mistral's tokenizers

0.2.0 2025-04-22 16:45 UTC

This package is auto-updated.

Last update: 2025-04-23 16:47:08 UTC


README

A PHP implementation of the Mistral tokenizers, focusing on the Tekken tokenizer.

Developed by Aisk.

Installation

composer require aisk/mistral-tokenizer

Features

  • Text tokenization and detokenization
  • Token counting
  • Batch processing
  • Special token handling
  • Unicode character support

Usage

Basic Usage

<?php

require_once 'vendor/autoload.php';

use Aisk\Tokenizer\TokenizerFactory;

// Create a tokenizer factory
$factory = new TokenizerFactory();

// Get the Tekken tokenizer
$tokenizer = $factory->getTekkenTokenizer('240911');

// Encode a string
$text = 'Hello, world!';
$tokens = $tokenizer->encode($text);
echo "Encoded '{$text}' to " . count($tokens) . " tokens: " . implode(', ', $tokens) . PHP_EOL;

// Count tokens
$count = $tokenizer->countTokens($text);
echo "'{$text}' has {$count} tokens" . PHP_EOL;

// Decode tokens back to text
$decoded = $tokenizer->decode($tokens);
echo "Decoded tokens back to: '{$decoded}'" . PHP_EOL;

Available Tokenizers

  • Tekken Tokenizer: A byte-pair encoding tokenizer optimized for Mistral AI models
    • Supports both '240718' and '240911' versions

Batch Processing

<?php

$texts = [
    'Hello, world!',
    'How are you?',
    'Tokenizer test.'
];

// Encode in batch
$tokensBatch = $tokenizer->encodeBatch($texts);

// Decode in batch
$decodedTexts = $tokenizer->decodeBatch($tokensBatch);

Command Line Tool

The package includes a command line tool for tokenizing text:

# Tokenize text from the command line
./vendor/bin/tokenize "Hello, world!"

# Specify a tokenizer version
./vendor/bin/tokenize -v 240718 "Hello, world!"

# Tokenize text from a file
./vendor/bin/tokenize -f input.txt

# Only show the token count
./vendor/bin/tokenize -c "Hello, world!"

Implementation Notes

This implementation provides a simplified tokenizer for Mistral models. The current version uses a character-based tokenization for testing purposes. In a production environment, you would need to implement the full BPE algorithm with the model's merges.

Testing

Run the tests with PHPUnit:

./vendor/bin/phpunit

License

MIT