Semantic Cache¶

Cache LLM responses so that semantically equivalent queries return cached answers instantly, while strict context isolation prevents cross-domain leakage.

Why Semantic Caching?¶

Exact-match caches miss when users rephrase a question. A user asking "How do I return something?" won't hit a cache entry stored under "What is your return policy?" — even though the intent is identical.

HyperBinder's Semantic Cache solves this by encoding queries into hyperdimensional vectors and matching on similarity, not string equality.

flowchart LR
    Q["'How do I return something?'"] -->|semantic match| E["'What is your return policy?'"]
    E --> R["30-day returns..."]

How It Works¶

Under the hood, SemanticCache uses a Triple schema:

Slot	Field	Encoding	Role
subject	`query`	`SEMANTIC`	Matched by similarity
predicate	`context`	`EXACT`	Matched exactly — isolates domains
object	`response`	`SEMANTIC` (low weight)	Stored but not searched on

The EXACT encoding on context means that a query stored under "order" will never match a lookup under "billing", regardless of similarity threshold. This is enforced algebraically by the HDC encoding, not by post-hoc filtering.

Quick Start¶

from hybi import HyperBinder, SemanticCache

hb = HyperBinder("http://localhost:8000")
cache = SemanticCache(hb, collection="llm_cache")

# Store
cache.put("What is your return policy?", "order",
          "Items can be returned within 30 days.")

# Retrieve (semantic match)
hit = cache.get("How do I return something?", "order")
print(hit.response)  # "Items can be returned within 30 days."
print(hit.score)     # e.g., 0.82

# Wrong context -> guaranteed miss
assert cache.get("How do I return something?", "billing") is None

Tuning the Threshold¶

The threshold parameter controls the precision/recall trade-off:

Threshold	Precision	Recall	Best For
0.50	~85%	~90%	High recall, tolerant of looser matches
0.65 (default)	~93%	~83%	Balanced — good starting point
0.75+	~98%	~65%	Strict isolation, minimal false positives

Start with the default

The default of 0.65 was tuned on a 24-cluster benchmark with 96 query variations across 6 domains. Adjust up for stricter matching, down for higher recall.

You can override per-call:

# Strict match for this specific lookup
hit = cache.get("sensitive query", "finance", threshold=0.80)

Time-To-Live (TTL)¶

Entries can expire automatically:

from datetime import timedelta

# Default TTL for all entries
cache = SemanticCache(hb, default_ttl=timedelta(hours=4))

# Per-entry override
cache.put("Pricing info?", "billing", "$9.99/mo", ttl=timedelta(hours=1))

# No TTL (never expires) — the default if default_ttl is None
cache.put("FAQ answer", "general", "See our help center.")

Expired entries are skipped during get() — they are treated as cache misses.

Filtering with `should_cache`¶

Not all responses should be cached. Use the should_cache callback to skip personalized, transactional, or sensitive responses:

cache = SemanticCache(
    hb,
    should_cache=lambda query, ctx, response: (
        "personal" not in response.lower()
        and len(response) > 20  # skip trivial responses
    ),
)

# This will be silently skipped
cache.put("My info?", "account", "Your personal balance is $142.50")

Batch Seeding¶

Pre-populate the cache from a DataFrame of known Q&A pairs:

import pandas as pd

faq_df = pd.DataFrame([
    {"query": "Return policy?", "context": "order", "response": "30-day returns."},
    {"query": "Cancel subscription?", "context": "sub", "response": "Go to Settings."},
    {"query": "Payment methods?", "context": "billing", "response": "Visa, MC, PayPal."},
])

cache.seed(faq_df)

The DataFrame must contain columns: query, context, response. An optional expires_at column (epoch float) is respected if present.

Cache Management¶

# Statistics
stats = cache.stats()
# {"count": 150, "contexts": ["billing", "order", "support"], "collection": "llm_cache"}

# Invalidate a specific context (e.g., after content update)
cache.invalidate(context="billing")

# Clear everything
cache.clear()

Invalidation cost

invalidate(context=...) performs a full scan, filter, and re-ingest of non-matching entries. This is fine for small-to-medium caches but may be slow at very large scale (100K+ entries). For bulk invalidation, prefer clear() and re-seed.

Performance¶

The cache includes two built-in optimizations:

Embedding LRU cache — Repeated get() calls for the same query text skip the encode_fn entirely. Controlled by embedding_cache_size (default: 1024 entries).
Query builder caching — The internal ComposeQuery object is reused across calls, avoiding repeated schema validation.

Typical latencies (mock embeddings, local mode):

Cache Size	Put (p50)	Get-Hit (p50)	Get-Miss (p50)
1,000	~2ms	~3ms	~2ms
10,000	~3ms	~5ms	~3ms
100,000	~5ms	~8ms	~5ms

Real embedding latency

With a real embedding model (e.g., all-MiniLM-L6-v2), encode_fn typically adds 5-15ms per call. The embedding cache eliminates this for repeated queries.

Next Steps¶

API Reference: SemanticCache — Full method documentation
Compose System — How the underlying Triple schema works
Embeddings — Choosing and configuring an embedding model