Skip to content

Semantic Cache

Cache LLM responses so that semantically equivalent queries return cached answers instantly, while strict context isolation prevents cross-domain leakage.

Why Semantic Caching?

Exact-match caches miss when users rephrase a question. A user asking "How do I return something?" won't hit a cache entry stored under "What is your return policy?" — even though the intent is identical.

HyperBinder's Semantic Cache solves this by encoding queries into hyperdimensional vectors and matching on similarity, not string equality.

flowchart LR
    Q["'How do I return something?'"] -->|semantic match| E["'What is your return policy?'"]
    E --> R["30-day returns..."]

How It Works

Under the hood, SemanticCache uses a Triple schema:

Slot Field Encoding Role
subject query SEMANTIC Matched by similarity
predicate context EXACT Matched exactly — isolates domains
object response SEMANTIC (low weight) Stored but not searched on

The EXACT encoding on context means that a query stored under "order" will never match a lookup under "billing", regardless of similarity threshold. This is enforced algebraically by the HDC encoding, not by post-hoc filtering.

Quick Start

from hyperbinder import HyperBinder, SemanticCache

hb = HyperBinder(local=True, encode_fn=model.encode)
cache = SemanticCache(hb, collection="llm_cache")

# Store
cache.put("What is your return policy?", "order",
          "Items can be returned within 30 days.")

# Retrieve (semantic match)
hit = cache.get("How do I return something?", "order")
print(hit.response)  # "Items can be returned within 30 days."
print(hit.score)     # e.g., 0.82

# Wrong context -> guaranteed miss
assert cache.get("How do I return something?", "billing") is None

Tuning the Threshold

The threshold parameter controls the precision/recall trade-off:

Threshold Precision Recall Best For
0.50 ~85% ~90% High recall, tolerant of looser matches
0.65 (default) ~93% ~83% Balanced — good starting point
0.75+ ~98% ~65% Strict isolation, minimal false positives

Start with the default

The default of 0.65 was tuned on a 24-cluster benchmark with 96 query variations across 6 domains. Adjust up for stricter matching, down for higher recall.

You can override per-call:

# Strict match for this specific lookup
hit = cache.get("sensitive query", "finance", threshold=0.80)

Time-To-Live (TTL)

Entries can expire automatically:

from datetime import timedelta

# Default TTL for all entries
cache = SemanticCache(hb, default_ttl=timedelta(hours=4))

# Per-entry override
cache.put("Pricing info?", "billing", "$9.99/mo", ttl=timedelta(hours=1))

# No TTL (never expires) — the default if default_ttl is None
cache.put("FAQ answer", "general", "See our help center.")

Expired entries are skipped during get() — they are treated as cache misses.

Filtering with should_cache

Not all responses should be cached. Use the should_cache callback to skip personalized, transactional, or sensitive responses:

cache = SemanticCache(
    hb,
    should_cache=lambda query, ctx, response: (
        "personal" not in response.lower()
        and len(response) > 20  # skip trivial responses
    ),
)

# This will be silently skipped
cache.put("My info?", "account", "Your personal balance is $142.50")

Batch Seeding

Pre-populate the cache from a DataFrame of known Q&A pairs:

import pandas as pd

faq_df = pd.DataFrame([
    {"query": "Return policy?", "context": "order", "response": "30-day returns."},
    {"query": "Cancel subscription?", "context": "sub", "response": "Go to Settings."},
    {"query": "Payment methods?", "context": "billing", "response": "Visa, MC, PayPal."},
])

cache.seed(faq_df)

The DataFrame must contain columns: query, context, response. An optional expires_at column (epoch float) is respected if present.

Cache Management

# Statistics
stats = cache.stats()
# {"count": 150, "contexts": ["billing", "order", "support"], "collection": "llm_cache"}

# Invalidate a specific context (e.g., after content update)
cache.invalidate(context="billing")

# Clear everything
cache.clear()

Invalidation cost

invalidate(context=...) performs a full scan, filter, and re-ingest of non-matching entries. This is fine for small-to-medium caches but may be slow at very large scale (100K+ entries). For bulk invalidation, prefer clear() and re-seed.

Performance

The cache includes two built-in optimizations:

  1. Embedding LRU cache — Repeated get() calls for the same query text skip the encode_fn entirely. Controlled by embedding_cache_size (default: 1024 entries).

  2. Query builder caching — The internal ComposeQuery object is reused across calls, avoiding repeated schema validation.

Typical latencies (mock embeddings, local mode):

Cache Size Put (p50) Get-Hit (p50) Get-Miss (p50)
1,000 ~2ms ~3ms ~2ms
10,000 ~3ms ~5ms ~3ms
100,000 ~5ms ~8ms ~5ms

Real embedding latency

With a real embedding model (e.g., all-MiniLM-L6-v2), encode_fn typically adds 5-15ms per call. The embedding cache eliminates this for repeated queries.

Next Steps