textualization/sentencepiece

Google SentencePiece bindings using FFI and a C adapter.

Fund package maintenance!
Ko-Fi

v0.0.3 2024-02-14 16:01 UTC

This package is auto-updated.

Last update: 2024-05-14 16:30:03 UTC


README

This is a minimal wrapper on top of Google SentencePiece to enable executing the XLMRobertaTokenizer encode method.

It needs the dynamic library for SentencePiece built with aditional C wrapper functions, see the fork at [https://github.com/textualization/sentencepiece/].

A binary for the library can be downloaded by doing:

composer exec -- php -r "require 'vendor/autoload.php'; Textualization\SentencePiece\Vendor::check();"

but depending on platform and GLIBC you might need to compile it yourself and copy to vendor/textualization/sentencepiece/lib (create the folder if it doesn't exist). See src/Vendor.php for details.

Running the tests

To run the tests you'll need to install the library per the instructions above.

To fully test it, download this file sentencepiece.bpe.model and place it in tests/.