textualization/ropherta-tokenizer

GPT3Tokenizer (BPE) with Roberta-base vocabulary.

Fund package maintenance!
Ko-Fi

v0.0.7 2024-02-21 00:01 UTC

This package is auto-updated.

Last update: 2024-11-21 01:40:14 UTC


README

This is just a wrapper around GPT3Tokenizer using the HuggingFace RoBERTa vocab and merge files.

See GPT3 documentation for example use (or the generated test case under tests/).

XLM Tokenizer

To use the multilingual version, the SentencePiece dependency needs to be initialized and an aditional model file needs to be downloaded:

composer exec -- php -r "require 'vendor/autoload.php'; Textualization\SentencePiece\Vendor::check();"
composer exec -- php -r "require 'vendor/autoload.php'; Textualization\Ropherta\Tokenizer\Vendor::check();"

Sponsors

We thank our sponsor: