codechap / context-trimmer
A tokenizer-agnostic text preprocessor to trim context for LLMs.
Requires
- php: ^8.2
Requires (Dev)
- phpunit/phpunit: ^10.0
This package is auto-updated.
Last update: 2025-03-31 08:34:34 UTC
README
A tokenizer-agnostic text preprocessor for trimming context in LLM applications.
Requires PHP 8.2 or higher.
This library provides functions to process, trim, and optimize text for large language model (LLM) context windows. It includes options for removing short words, stripping extraneous punctuation, and compressing whitespace.
Installation
Install via Composer:
composer require codechap/context-trimmer:"dev-master"
Usage
Create a file (for example, run.php
) with the following code to see the ContextTrimmer in action:
require_once 'vendor/autoload.php'; use codechap\ContextTrimmer\ContextTrimmer; // Load your context from a file $input = file_get_contents('context.txt'); // Configure and trim the input text using chained setters $result = new ContextTrimmer() ->set('removeShortWords', true) ->set('minWordLength', 2) ->set('removeExtraneous', true) ->set('maxTokens', 50) ->trim($input); // Output the trimmed text segments as JSON echo json_encode($result, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE);
In this example, the ContextTrimmer
is configured to remove short words, strip extraneous punctuation, and limit tokens per segment (50 tokens in this case). The resulting trimmed output is returned as an array of text segments.
Running Tests
To run the tests, use:
composer test
License
This library is released under the MIT License. See the LICENSE file for details.
Contributing
Contributions and pull requests are welcome! Please follow the existing coding standards and include tests for new functionality.