Fuzzy-to-Exact Bridge Pattern¶
Combine semantic search (fuzzy) with exact filtering to get deterministic results safe for CRUD operations.
The Problem¶
Semantic search is powerful for discovery but inherently non-deterministic:
# Same query, different runs = potentially different results
results = hb.query("users", catalog_schema).search("ML expert")
# Run 1: [Alice, Bob, Carol]
# Run 2: [Alice, Carol, David] # Bob dropped below threshold
This makes semantic search unsafe for mutations. You don't want to accidentally update/delete the wrong rows because similarity scores shifted.
The Solution: Bridge Pattern¶
Use fuzzy search to narrow the search space, then exact filters to define deterministic boundaries:
flowchart TB
subgraph Catalog["CATALOG (Search)"]
C1[Semantic Index]
C2[Fuzzy Discovery]
C1 --> C2
end
subgraph RelationalTable["RELATIONAL TABLE (CRUD)"]
R1[Chain Binding]
R2[PK Lookups]
R1 --> R2
end
C2 --> Bridge
R2 --> Bridge
Bridge[BRIDGE<br/>Primary Keys] --> Mutations[Deterministic Mutations]
Key insight: Each compound does what it's optimized for. Catalog handles search, RelationalTable handles CRUD, and primary keys bridge between them.
Separation of Concerns¶
| Compound | Encoding | Optimized For | Use In Pattern |
|---|---|---|---|
| Catalog | Search-optimized | Fast semantic search | Discovery stage |
| RelationalTable | Structured | CRUD operations | Mutation stage |
This respects the design intent: - Catalog's search-optimized encoding enables fast similarity search - RelationalTable's structured encoding enables exact field updates
Implementation¶
FuzzyToExactBridge Class¶
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
@dataclass
class RefinedResult:
"""Result from fuzzy-to-exact refinement."""
pk_field: str
pk_value: str
data: Dict[str, Any]
semantic_score: float
exact_matches: Dict[str, bool]
class FuzzyToExactBridge:
"""
Bridge that connects semantic search to exact filtering.
Usage:
bridge = FuzzyToExactBridge(hb, "users", users_schema)
results = (bridge
.fuzzy("machine learning expertise", top_k=50)
.exact(department="Engineering")
.exact(status="active")
.numeric(salary__gt=100000)
.execute())
# Results are deterministic - safe for CRUD
for r in results:
hb.update("users", where={r.pk_field: r.pk_value}, set={...})
"""
def __init__(self, client, collection: str, schema):
self.client = client
self.collection = collection
self.schema = schema
self.pk_field = schema.primary_key
# Query state
self._fuzzy_query: Optional[str] = None
self._fuzzy_top_k: int = 50
self._exact_filters: Dict[str, Any] = {}
self._numeric_filters: List[tuple] = []
def fuzzy(self, query: str, top_k: int = 50) -> "FuzzyToExactBridge":
"""Stage 1: Semantic search to find candidates."""
self._fuzzy_query = query
self._fuzzy_top_k = top_k
return self
def exact(self, **kwargs) -> "FuzzyToExactBridge":
"""Stage 2a: Exact field match filter."""
self._exact_filters.update(kwargs)
return self
def numeric(self, **kwargs) -> "FuzzyToExactBridge":
"""
Stage 2b: Numeric range filter.
Supports: field__gt, field__lt, field__gte, field__lte
"""
for key, value in kwargs.items():
if "__" in key:
field, op = key.rsplit("__", 1)
self._numeric_filters.append((field, op, value))
else:
self._numeric_filters.append((key, "eq", value))
return self
def execute(self) -> List[RefinedResult]:
"""Execute the fuzzy-to-exact pipeline."""
# Stage 1: Fuzzy search
if self._fuzzy_query:
candidates = self._semantic_search()
else:
candidates = self._get_all_candidates()
# Stage 2: Exact + Numeric filtering
refined = []
for candidate in candidates:
data = candidate["data"]
score = candidate.get("score", 1.0)
# Check exact filters
exact_matches = {}
passes_exact = True
for field, expected in self._exact_filters.items():
actual = data.get(field)
matches = str(actual) == str(expected)
exact_matches[field] = matches
if not matches:
passes_exact = False
if not passes_exact:
continue
# Check numeric filters
passes_numeric = self._check_numeric_filters(data)
if not passes_numeric:
continue
refined.append(RefinedResult(
pk_field=self.pk_field,
pk_value=str(data.get(self.pk_field)),
data=data,
semantic_score=score,
exact_matches=exact_matches,
))
return refined
def _semantic_search(self) -> List[Dict]:
"""Perform semantic search via query interface."""
results = self.client.query(self.collection, self.schema).search(
self._fuzzy_query,
top_k=self._fuzzy_top_k,
)
return [{"data": r.data, "score": getattr(r, "score", 1.0)} for r in results]
def _get_all_candidates(self) -> List[Dict]:
"""Get all rows when no fuzzy query specified."""
results = self.client.select(self.collection, limit=1000)
return [{"data": r, "score": 1.0} for r in results.rows]
def _check_numeric_filters(self, data: Dict) -> bool:
"""Check all numeric filter conditions."""
for field, op, threshold in self._numeric_filters:
try:
actual = float(data.get(field, 0))
threshold = float(threshold)
if op == "gt" and not (actual > threshold):
return False
elif op == "lt" and not (actual < threshold):
return False
elif op == "gte" and not (actual >= threshold):
return False
elif op == "lte" and not (actual <= threshold):
return False
elif op == "eq" and not (actual == threshold):
return False
except (ValueError, TypeError):
return False
return True
Complete Example: Dual-Schema Pattern¶
This example shows the recommended architecture: Catalog for search, RelationalTable for CRUD.
from hybi import HyperBinder
from hybi.compose import Catalog, RelationalTable, Field, Encoding
import pandas as pd
# ----- SCHEMA DEFINITIONS -----
# Catalog schema: Optimized for semantic search (Bundle encoding)
users_search_schema = Catalog(
fields={
"user_id": {"encoding": "exact"}, # For bridging to CRUD
"name": {"encoding": "semantic", "weight": 1.0},
"bio": {"encoding": "semantic", "weight": 1.5}, # Boost bio matches
"department": {"encoding": "exact"},
"status": {"encoding": "exact"},
}
)
# RelationalTable schema: Optimized for CRUD (Row encoding)
users_crud_schema = RelationalTable(
columns={
"user_id": Field(encoding=Encoding.EXACT),
"name": Field(encoding=Encoding.SEMANTIC),
"bio": Field(encoding=Encoding.SEMANTIC),
"department": Field(encoding=Encoding.EXACT),
"status": Field(encoding=Encoding.EXACT),
"salary": Field(encoding=Encoding.NUMERIC, similar_within=10000),
},
primary_key="user_id",
)
# ----- DATA -----
users_data = pd.DataFrame([
{"user_id": "U001", "name": "Alice Chen",
"bio": "Machine learning engineer with NLP expertise",
"department": "Engineering", "status": "active", "salary": 150000},
{"user_id": "U002", "name": "Bob Smith",
"bio": "Backend developer focusing on APIs",
"department": "Engineering", "status": "active", "salary": 120000},
{"user_id": "U003", "name": "Carol White",
"bio": "ML researcher specializing in computer vision",
"department": "Research", "status": "active", "salary": 140000},
{"user_id": "U004", "name": "David Lee",
"bio": "Data scientist with machine learning background",
"department": "Engineering", "status": "inactive", "salary": 130000},
{"user_id": "U005", "name": "Eve Johnson",
"bio": "Deep learning specialist in NLP",
"department": "Engineering", "status": "active", "salary": 160000},
])
# ----- INITIALIZATION -----
hb = HyperBinder(local=True, db_path="./fuzzy_exact_demo_db")
# Ingest into BOTH collections
# Search collection: Uses Catalog (fast semantic search)
hb.ingest(users_data, collection="users_search", schema=users_search_schema)
# CRUD collection: Uses RelationalTable (exact field updates)
hb.ingest(users_data, collection="users", schema=users_crud_schema)
# ----- BRIDGE PATTERN -----
# Stage 1: Semantic search via Catalog (fast, fuzzy)
candidates = hb.query("users_search", users_search_schema).search(
"machine learning NLP expert",
top_k=10,
)
print(f"Stage 1 - Fuzzy search found {len(candidates)} candidates")
# Stage 2: Exact filtering on candidate data
refined = []
for r in candidates:
data = r.data
if (data.get("department") == "Engineering" and
data.get("status") == "active"):
refined.append({
"pk": data["user_id"],
"data": data,
"score": getattr(r, "score", 1.0),
})
print(f"Stage 2 - After exact filters: {len(refined)} results")
# Stage 3: CRUD via RelationalTable (deterministic)
for r in refined:
pk = r["pk"]
# Get current row from CRUD collection (has salary field)
current = hb.query("users", users_crud_schema).get(user_id=pk)
if current and current.data.get("salary", 0) > 125000:
print(f" {pk}: {r['data']['name']} - eligible for update")
# Safe to mutate - we have a deterministic PK
hb.update(
"users",
where={"user_id": pk},
set={"salary": int(current.data["salary"] * 1.1)}, # 10% raise
schema=users_crud_schema,
)
Output:
Stage 1 - Fuzzy search found 5 candidates
Stage 2 - After exact filters: 3 results
U001: Alice Chen - eligible for update
U005: Eve Johnson - eligible for update
Single-Schema Alternative¶
If you don't need separate optimization, RelationalTable can do both (with some search overhead):
# Single schema handles both search and CRUD
users_schema = RelationalTable(
columns={
"user_id": Field(encoding=Encoding.EXACT),
"name": Field(encoding=Encoding.SEMANTIC),
"bio": Field(encoding=Encoding.SEMANTIC),
"department": Field(encoding=Encoding.EXACT),
"status": Field(encoding=Encoding.EXACT),
"salary": Field(encoding=Encoding.NUMERIC, similar_within=10000),
},
primary_key="user_id",
)
# Ingest once
hb.ingest(users_data, collection="users", schema=users_schema)
# Search and CRUD on same collection
candidates = hb.query("users", users_schema).search("ML expert", top_k=10)
# ... filter and mutate ...
Trade-off: Simpler setup, but search is slower than dedicated Catalog.
When to Use This Pattern¶
Use the bridge pattern when:
- You need semantic discovery ("find people like X")
- But require deterministic results for mutations
- Your data has both semantic fields (name, bio) and exact fields (department, status)
Don't use this pattern when:
- You already know the exact primary keys
- Pure exact matching suffices (use
filter()directly) - Read-only semantic search (no mutations needed)
Architecture Notes¶
Dual-Schema Architecture (Recommended)¶
| Collection | Compound | Encoding | Purpose |
|---|---|---|---|
users_search |
Catalog | Search-optimized | Fast semantic discovery |
users |
RelationalTable | Structured | CRUD operations |
Benefits: - Each compound is optimized for its use case - Search performance isn't degraded by CRUD overhead - Clear separation of concerns
Trade-off: Data is stored twice (search index + CRUD collection). Keep them in sync by re-ingesting when source data changes.
Single-Schema Architecture¶
If storage/sync overhead is a concern, use RelationalTable for both:
| Collection | Compound | Encoding | Purpose |
|---|---|---|---|
users |
RelationalTable | Row | Both search and CRUD |
Benefits: - Single source of truth - No sync issues
Trade-off: Search is slower than dedicated Catalog.
Sync Strategy¶
When using dual-schema, keep collections in sync:
# Option 1: Full re-ingest (simple, slower)
def sync_collections(data, hb):
hb.ingest(data, collection="users_search", schema=search_schema)
hb.ingest(data, collection="users", schema=crud_schema)
# Option 2: Incremental (complex, faster)
# - CRUD changes go to RelationalTable only
# - Periodic batch sync to Catalog for search
# - Accept that search may be slightly stale
The bridge pattern works regardless of sync strategy because it always reads current PKs from the search index, then operates on the CRUD collection.
See Also¶
- RelationalTable - Schema definition
- CRUD Operations - Client methods
- Intersections Tutorial - Cross-collection joins