Skip to content

Semantic Cache

High-level LLM response caching with semantic matching and context isolation.

Overview

SemanticCache wraps HyperBinder's compositional encoding to provide a put/get cache where queries are matched semantically (different wording, same intent) while context keys are matched exactly, preventing cross-context false positives.

from hybi import HyperBinder, SemanticCache
hb = HyperBinder(local=True, encode_fn=model.encode)
cache = SemanticCache(hb, collection="llm_cache")
# Store a response
cache.put(query="What is your return policy?", context="order",
          response="Items can be returned within 30 days.")
# Retrieve – semantically similar queries hit the cache
hit = cache.get(query="How do I return something?", context="order")
if hit:
    print(hit.response)  # "Items can be returned within 30 days."
# Different context -> no cross-contamination
miss = cache.get(query="How do I return something?", context="billing")
assert miss is None


SemanticCache

hybi.semantic_cache.SemanticCache

High-level semantic cache for LLM responses.

Uses a Bundle schema internally with four fields:

  • query (SEMANTIC): query text — for similarity matching
  • context (EXACT): context key — for domain isolation
  • response (SEMANTIC, low weight): response text — stored, not searched on
  • _cache_key (EXACT, zero weight): content hash for dedup — never searched on

Parameters:

Name Type Description Default
hb

A HyperBinder client instance (local or remote).

required
collection str

Name of the backing collection.

'semantic_cache'
threshold float

Default similarity threshold for cache hits. Benchmark-tuned default of 0.65 balances recall (83%) with precision (93%). Use >=0.75 for strict context isolation.

0.65
query_weight float

Weight for the query (subject) slot. Higher values make query similarity dominate the score.

2.0
response_weight float

Weight for the response (object) slot. Keep low so stored responses don't pollute search.

0.1
top_k int

Number of candidates to scan during get().

5
default_ttl Optional[Union[timedelta, int, float]]

Default time-to-live for cache entries. Entries older than this are treated as misses. None means no expiry. Accepts timedelta or seconds as int/float.

None
should_cache Optional[Callable[[str, str, str], bool]]

Optional callback (query, context, response) -> bool called before storing an entry. Return False to skip caching (e.g. for personalized or transactional responses).

None
embedding_cache_size int

Max number of query embeddings to cache in-process. Avoids redundant encoding calls for repeated queries. Set to 0 to disable.

1024

Example::

from hybi import HyperBinder, SemanticCache

hb = HyperBinder(local=True)
cache = SemanticCache(hb, collection="llm_cache")

# Store a response
cache.put(
    query="What is your return policy?",
    context="order",
    response="Items can be returned within 30 days.",
)

# Retrieve — semantically similar queries hit the cache
hit = cache.get(query="How do I return something?", context="order")
if hit:
    print(hit.response)  # "Items can be returned within 30 days."

# Different context -> no cross-contamination
miss = cache.get(query="How do I return something?", context="billing")
assert miss is None

Example with TTL and filtering::

cache = SemanticCache(
    hb,
    collection="llm_cache",
    default_ttl=timedelta(hours=4),
    should_cache=lambda q, ctx, r: "personal" not in r.lower(),
)
cache.put("Pricing info?", "billing", "$9.99/mo", ttl=timedelta(hours=1))

__init__(hb, collection='semantic_cache', *, threshold=0.65, query_weight=2.0, response_weight=0.1, top_k=5, default_ttl=None, should_cache=None, embedding_cache_size=1024, phase_dim=None)

put(query, context, response, *, ttl=None)

Store a cache entry.

If an entry with the same query+context already exists, it is replaced (upsert semantics).

Parameters:

Name Type Description Default
query str

The query text (matched semantically on lookup).

required
context str

Context key for isolation (matched exactly).

required
response str

The response to cache.

required
ttl Optional[Union[timedelta, int, float]]

Time-to-live for this entry. Overrides default_ttl. Accepts timedelta or seconds as int/float. None uses the cache's default_ttl.

None

get(query, context=None, *, threshold=None, top_k=None)

Look up a cache entry by semantic similarity.

Parameters:

Name Type Description Default
query str

The query text to match against cached entries.

required
context Optional[str]

Context key — must match exactly. If None, searches across all contexts.

None
threshold Optional[float]

Override the default similarity threshold.

None
top_k Optional[int]

Override the default number of candidates to scan.

None

Returns:

Name Type Description
A Optional[CacheHit]

class:CacheHit if a matching entry is found, otherwise None.

seed(data)

Batch-populate the cache from a DataFrame or list of dicts.

The data must contain columns: query, context, response. An optional expires_at column (epoch float) is respected if present.

Parameters:

Name Type Description Default
data Union[DataFrame, List[Dict[str, str]]]

DataFrame or list of dicts with cache entries.

required

Returns:

Type Description
int

Number of entries seeded.

Raises:

Type Description
ValueError

If required columns are missing.

invalidate(*, context=None)

Remove cache entries.

Parameters:

Name Type Description Default
context Optional[str]

If provided, remove only entries with this context. If None, remove all entries (same as :meth:clear).

None

clear()

Remove all cache entries.

stats()

Get cache statistics.

Returns:

Type Description
Dict[str, Any]

Dict with keys: count, contexts, collection.

exists()

Check if the backing collection exists.


CacheHit

hybi.semantic_cache.CacheHit dataclass

Result from a successful cache lookup.

Attributes:

Name Type Description
response str

The cached response text.

score float

Similarity score of the match (0-1).

query str

The original cached query text that matched.

context str

The context key that was matched.

data Dict[str, Any]

Full row data dict for extensibility.


Constructor Parameters

Parameter Type Default Description
hb HyperBinder required A HyperBinder client (local or remote)
collection str "semantic_cache" Name of the backing collection
threshold float 0.65 Similarity threshold for cache hits
query_weight float 2.0 Weight for query similarity (higher = query dominates)
response_weight float 0.1 Weight for response slot (keep low to avoid search pollution)
top_k int 5 Number of candidates to scan per get()
default_ttl timedelta, int, float, or None None Default time-to-live. None = no expiry
should_cache Callable or None None Filter callback (query, context, response) -> bool
embedding_cache_size int 1024 In-process embedding LRU cache size. 0 to disable
---
## Method Summary
Method Description
-------- -------------
put(query, context, response, *, ttl) Store a cache entry
get(query, context, *, threshold) Look up by semantic similarity
seed(df) Batch-populate from a DataFrame
invalidate(*, context) Remove entries (all or by context)
clear() Remove all entries (alias for invalidate())
stats() Get entry count and context list
exists() Check if the backing collection exists