Semantic Cache¶
Cache LLM responses so that semantically equivalent queries return cached answers instantly, while strict context isolation prevents cross-domain leakage.
Why Semantic Caching?¶
Exact-match caches miss when users rephrase a question. A user asking "How do I return something?" won't hit a cache entry stored under "What is your return policy?" — even though the intent is identical.
HyperBinder's Semantic Cache solves this by encoding queries into hyperdimensional vectors and matching on similarity, not string equality.
flowchart LR
Q["'How do I return something?'"] -->|semantic match| E["'What is your return policy?'"]
E --> R["30-day returns..."]
How It Works¶
Under the hood, SemanticCache uses a Triple schema:
| Slot | Field | Encoding | Role |
|---|---|---|---|
| subject | query |
SEMANTIC |
Matched by similarity |
| predicate | context |
EXACT |
Matched exactly — isolates domains |
| object | response |
SEMANTIC (low weight) |
Stored but not searched on |
The EXACT encoding on context means that a query stored under "order" will never match a lookup under "billing", regardless of similarity threshold. This is enforced algebraically by the HDC encoding, not by post-hoc filtering.
Quick Start¶
from hyperbinder import HyperBinder, SemanticCache
hb = HyperBinder(local=True, encode_fn=model.encode)
cache = SemanticCache(hb, collection="llm_cache")
# Store
cache.put("What is your return policy?", "order",
"Items can be returned within 30 days.")
# Retrieve (semantic match)
hit = cache.get("How do I return something?", "order")
print(hit.response) # "Items can be returned within 30 days."
print(hit.score) # e.g., 0.82
# Wrong context -> guaranteed miss
assert cache.get("How do I return something?", "billing") is None
Tuning the Threshold¶
The threshold parameter controls the precision/recall trade-off:
| Threshold | Precision | Recall | Best For |
|---|---|---|---|
| 0.50 | ~85% | ~90% | High recall, tolerant of looser matches |
| 0.65 (default) | ~93% | ~83% | Balanced — good starting point |
| 0.75+ | ~98% | ~65% | Strict isolation, minimal false positives |
Start with the default
The default of 0.65 was tuned on a 24-cluster benchmark with 96 query variations across 6 domains. Adjust up for stricter matching, down for higher recall.
You can override per-call:
# Strict match for this specific lookup
hit = cache.get("sensitive query", "finance", threshold=0.80)
Time-To-Live (TTL)¶
Entries can expire automatically:
from datetime import timedelta
# Default TTL for all entries
cache = SemanticCache(hb, default_ttl=timedelta(hours=4))
# Per-entry override
cache.put("Pricing info?", "billing", "$9.99/mo", ttl=timedelta(hours=1))
# No TTL (never expires) — the default if default_ttl is None
cache.put("FAQ answer", "general", "See our help center.")
Expired entries are skipped during get() — they are treated as cache misses.
Filtering with should_cache¶
Not all responses should be cached. Use the should_cache callback to skip personalized, transactional, or sensitive responses:
cache = SemanticCache(
hb,
should_cache=lambda query, ctx, response: (
"personal" not in response.lower()
and len(response) > 20 # skip trivial responses
),
)
# This will be silently skipped
cache.put("My info?", "account", "Your personal balance is $142.50")
Batch Seeding¶
Pre-populate the cache from a DataFrame of known Q&A pairs:
import pandas as pd
faq_df = pd.DataFrame([
{"query": "Return policy?", "context": "order", "response": "30-day returns."},
{"query": "Cancel subscription?", "context": "sub", "response": "Go to Settings."},
{"query": "Payment methods?", "context": "billing", "response": "Visa, MC, PayPal."},
])
cache.seed(faq_df)
The DataFrame must contain columns: query, context, response. An optional expires_at column (epoch float) is respected if present.
Cache Management¶
# Statistics
stats = cache.stats()
# {"count": 150, "contexts": ["billing", "order", "support"], "collection": "llm_cache"}
# Invalidate a specific context (e.g., after content update)
cache.invalidate(context="billing")
# Clear everything
cache.clear()
Invalidation cost
invalidate(context=...) performs a full scan, filter, and re-ingest of non-matching entries. This is fine for small-to-medium caches but may be slow at very large scale (100K+ entries). For bulk invalidation, prefer clear() and re-seed.
Performance¶
The cache includes two built-in optimizations:
-
Embedding LRU cache — Repeated
get()calls for the same query text skip theencode_fnentirely. Controlled byembedding_cache_size(default: 1024 entries). -
Query builder caching — The internal
ComposeQueryobject is reused across calls, avoiding repeated schema validation.
Typical latencies (mock embeddings, local mode):
| Cache Size | Put (p50) | Get-Hit (p50) | Get-Miss (p50) |
|---|---|---|---|
| 1,000 | ~2ms | ~3ms | ~2ms |
| 10,000 | ~3ms | ~5ms | ~3ms |
| 100,000 | ~5ms | ~8ms | ~5ms |
Real embedding latency
With a real embedding model (e.g., all-MiniLM-L6-v2), encode_fn typically adds 5-15ms per call. The embedding cache eliminates this for repeated queries.
Next Steps¶
- API Reference: SemanticCache — Full method documentation
- Compose System — How the underlying Triple schema works
- Embeddings — Choosing and configuring an embedding model