ykachala / semantic-cache
Embedding-similarity response cache for LLM calls. Serve a cached answer when a new prompt is semantically close to a previous one — cutting cost and latency.
Requires
- php: >=8.3
- psr/log: ^3.0
- psr/simple-cache: ^3.0
Requires (Dev)
- friendsofphp/php-cs-fixer: ^3.0
- phpstan/phpstan: ^2.0
- phpunit/phpunit: ^11.0
Suggests
- ext-pdo: Required for the pgvector store
- predis/predis: Required for the Redis vector store
- ykachala/meter: Measure exactly how much spend the cache is saving you
This package is auto-updated.
Last update: 2026-06-02 09:07:06 UTC
README
An embedding-similarity cache for LLM responses. When a new prompt is semantically close to one you've answered before, serve the cached answer instead of paying for — and waiting on — another model call.
Why this exists (the 2026 gap)
A normal cache keys on an exact string. LLM prompts are almost never byte-identical — "How do I reset my password?" and "I forgot my password, what now?" are the same question to a user but two different cache keys. So traditional caching gives near-zero hit rate on real LLM traffic, and teams pay full price for what is effectively the same answer thousands of times.
Python has GPTCache for exactly this. PHP, as of 2026, has nothing native — despite the PHP AI ecosystem (Prism, Neuron, Laravel AI SDK) now being mature enough that cost is the live problem. Ykachala Semantic Cache fills that hole.
How it works
- Embed the incoming prompt.
- Search the vector store for the nearest previously-cached prompt.
- If cosine similarity ≥ your threshold (e.g.
0.95), return the stored response — no LLM call. - Otherwise call your model, then store
(embedding, prompt, response)for next time.
A cheap exact-match tier runs first (hash lookup) so identical prompts never even pay for an embedding.
Install
composer require ykachala/semantic-cache
Quick start
use Ykachala\SemanticCache\SemanticCache; use Ykachala\SemanticCache\Store\PgVectorStore; $cache = new SemanticCache( embedder: $yourEmbedder, // any PHP closure/object that returns a vector store: new PgVectorStore($pdo), threshold: 0.95, // tune for your risk tolerance ttl: 3600, ); $answer = $cache->remember($prompt, function () use ($prompt, $llm) { // Only runs on a miss — this is the call you're trying to avoid return $llm->chat($prompt); });
Inspecting hits
$result = $cache->lookup($prompt); if ($result->hit) { logger()->info('semantic cache hit', [ 'similarity' => $result->similarity, // 0.0 – 1.0 'matched' => $result->matchedPrompt, 'saved' => $result->estimatedSaving?->format(), ]); }
Tiers & safety
| Tier | Cost | When |
|---|---|---|
| Exact | hash lookup, ~0 | byte-identical prompt |
| Semantic | 1 embedding + 1 vector search | similar prompt above threshold |
| Miss | full LLM call | nothing close enough |
- Namespaces isolate caches per user/tenant/prompt-template so you never serve one user's answer to another.
- Threshold tuning trades hit-rate for correctness —
0.97+for factual lookups, lower for chit-chat. Ship with metrics so you can tune from real traffic. - Stampede protection — concurrent misses for the same prompt collapse to one call.
Pluggable stores
InMemoryStore # tests / single process
Psr16Store # brute-force over any PSR-16 cache, good for small sets
RedisStore # Redis 8 vector sets
PgVectorStore # Postgres + pgvector, production default
QdrantStore # external vector DB at scale
Architecture
src/
├── SemanticCache.php # remember() / lookup() / put()
├── Lookup.php # result: hit, similarity, matchedPrompt, saving
├── Embedder/ # EmbedderInterface + adapters
├── Store/ # VectorStore interface + drivers
└── Similarity.php # cosine / dot-product helpers
Roadmap
- Core
SemanticCache(remember/lookup/put) +Lookupresult - Cosine similarity + exact-match tier
-
EmbedderInterface+ adapters - In-memory + PSR-16 stores (brute force)
- pgvector + Redis + Qdrant stores
- Namespaces, TTL, stampede protection, hit-rate metrics
See CLAUDE.md for the full phase plan and conventions.
License
MIT