textualization/ropherta-tokenizer

GPT3Tokenizer (BPE) with Roberta-base vocabulary.

Fund package maintenance!
Ko-Fi

v0.0.7 2024-02-21 00:01 UTC

This package is auto-updated.

Last update: 2024-04-21 00:22:54 UTC


README

This is just a wrapper around GPT3Tokenizer using the HuggingFace RoBERTa vocab and merge files.

See GPT3 documentation for example use (or the generated test case under tests/).

XLM Tokenizer

To use the multilingual version, the SentencePiece dependency needs to be initialized and an aditional model file needs to be downloaded:

composer exec -- php -r "require 'vendor/autoload.php'; Textualization\SentencePiece\Vendor::check();"
composer exec -- php -r "require 'vendor/autoload.php'; Textualization\Ropherta\Tokenizer\Vendor::check();"

Sponsors

We thank our sponsor:

68747470733a2f2f65766f6c75646174612e636f6d2f646973706c6179323038