Fuzzy-to-Exact Bridge Pattern¶

Combine semantic search (fuzzy) with exact filtering to get deterministic results safe for CRUD operations.

The Problem¶

Semantic search is powerful for discovery but inherently non-deterministic:

# Same query, different runs = potentially different results
results = hb.query("users", catalog_schema).search("ML expert")
# Run 1: [Alice, Bob, Carol]
# Run 2: [Alice, Carol, David]  # Bob dropped below threshold

This makes semantic search unsafe for mutations. You don't want to accidentally update/delete the wrong rows because similarity scores shifted.

The Solution: Bridge Pattern¶

Use fuzzy search to narrow the search space, then exact filters to define deterministic boundaries:

flowchart TB
    subgraph Catalog["CATALOG (Search)"]
        C1[Semantic Index]
        C2[Fuzzy Discovery]
        C1 --> C2
    end

    subgraph RelationalTable["RELATIONAL TABLE (CRUD)"]
        R1[Chain Binding]
        R2[PK Lookups]
        R1 --> R2
    end

    C2 --> Bridge
    R2 --> Bridge
    Bridge[BRIDGE<br/>Primary Keys] --> Mutations[Deterministic Mutations]

Key insight: Each compound does what it's optimized for. Catalog handles search, RelationalTable handles CRUD, and primary keys bridge between them.

Separation of Concerns¶

Compound	Encoding	Optimized For	Use In Pattern
Catalog	Search-optimized	Fast semantic search	Discovery stage
RelationalTable	Structured	CRUD operations	Mutation stage

This respects the design intent: - Catalog's search-optimized encoding enables fast similarity search - RelationalTable's structured encoding enables exact field updates

Implementation¶

FuzzyToExactBridge Class¶

from typing import List, Dict, Any, Optional
from dataclasses import dataclass


@dataclass
class RefinedResult:
    """Result from fuzzy-to-exact refinement."""
    pk_field: str
    pk_value: str
    data: Dict[str, Any]
    semantic_score: float
    exact_matches: Dict[str, bool]


class FuzzyToExactBridge:
    """
    Bridge that connects semantic search to exact filtering.

    Usage:
        bridge = FuzzyToExactBridge(hb, "users", users_schema)

        results = (bridge
            .fuzzy("machine learning expertise", top_k=50)
            .exact(department="Engineering")
            .exact(status="active")
            .numeric(salary__gt=100000)
            .execute())

        # Results are deterministic - safe for CRUD
        for r in results:
            hb.update("users", where={r.pk_field: r.pk_value}, set={...})
    """

    def __init__(self, client, collection: str, schema):
        self.client = client
        self.collection = collection
        self.schema = schema
        self.pk_field = schema.primary_key

        # Query state
        self._fuzzy_query: Optional[str] = None
        self._fuzzy_top_k: int = 50
        self._exact_filters: Dict[str, Any] = {}
        self._numeric_filters: List[tuple] = []

    def fuzzy(self, query: str, top_k: int = 50) -> "FuzzyToExactBridge":
        """Stage 1: Semantic search to find candidates."""
        self._fuzzy_query = query
        self._fuzzy_top_k = top_k
        return self

    def exact(self, **kwargs) -> "FuzzyToExactBridge":
        """Stage 2a: Exact field match filter."""
        self._exact_filters.update(kwargs)
        return self

    def numeric(self, **kwargs) -> "FuzzyToExactBridge":
        """
        Stage 2b: Numeric range filter.

        Supports: field__gt, field__lt, field__gte, field__lte
        """
        for key, value in kwargs.items():
            if "__" in key:
                field, op = key.rsplit("__", 1)
                self._numeric_filters.append((field, op, value))
            else:
                self._numeric_filters.append((key, "eq", value))
        return self

    def execute(self) -> List[RefinedResult]:
        """Execute the fuzzy-to-exact pipeline."""
        # Stage 1: Fuzzy search
        if self._fuzzy_query:
            candidates = self._semantic_search()
        else:
            candidates = self._get_all_candidates()

        # Stage 2: Exact + Numeric filtering
        refined = []
        for candidate in candidates:
            data = candidate["data"]
            score = candidate.get("score", 1.0)

            # Check exact filters
            exact_matches = {}
            passes_exact = True
            for field, expected in self._exact_filters.items():
                actual = data.get(field)
                matches = str(actual) == str(expected)
                exact_matches[field] = matches
                if not matches:
                    passes_exact = False

            if not passes_exact:
                continue

            # Check numeric filters
            passes_numeric = self._check_numeric_filters(data)
            if not passes_numeric:
                continue

            refined.append(RefinedResult(
                pk_field=self.pk_field,
                pk_value=str(data.get(self.pk_field)),
                data=data,
                semantic_score=score,
                exact_matches=exact_matches,
            ))

        return refined

    def _semantic_search(self) -> List[Dict]:
        """Perform semantic search via query interface."""
        results = self.client.query(self.collection, self.schema).search(
            self._fuzzy_query,
            top_k=self._fuzzy_top_k,
        )
        return [{"data": r.data, "score": getattr(r, "score", 1.0)} for r in results]

    def _get_all_candidates(self) -> List[Dict]:
        """Get all rows when no fuzzy query specified."""
        results = self.client.select(self.collection, limit=1000)
        return [{"data": r, "score": 1.0} for r in results.rows]

    def _check_numeric_filters(self, data: Dict) -> bool:
        """Check all numeric filter conditions."""
        for field, op, threshold in self._numeric_filters:
            try:
                actual = float(data.get(field, 0))
                threshold = float(threshold)

                if op == "gt" and not (actual > threshold):
                    return False
                elif op == "lt" and not (actual < threshold):
                    return False
                elif op == "gte" and not (actual >= threshold):
                    return False
                elif op == "lte" and not (actual <= threshold):
                    return False
                elif op == "eq" and not (actual == threshold):
                    return False
            except (ValueError, TypeError):
                return False
        return True

Complete Example: Dual-Schema Pattern¶

This example shows the recommended architecture: Catalog for search, RelationalTable for CRUD.

from hybi import HyperBinder
from hybi.compose import Catalog, RelationalTable, Field, Encoding
import pandas as pd

# ----- SCHEMA DEFINITIONS -----

# Catalog schema: Optimized for semantic search (Bundle encoding)
users_search_schema = Catalog(
    fields={
        "user_id": {"encoding": "exact"},              # For bridging to CRUD
        "name": {"encoding": "semantic", "weight": 1.0},
        "bio": {"encoding": "semantic", "weight": 1.5},  # Boost bio matches
        "department": {"encoding": "exact"},
        "status": {"encoding": "exact"},
    }
)

# RelationalTable schema: Optimized for CRUD (Row encoding)
users_crud_schema = RelationalTable(
    columns={
        "user_id": Field(encoding=Encoding.EXACT),
        "name": Field(encoding=Encoding.SEMANTIC),
        "bio": Field(encoding=Encoding.SEMANTIC),
        "department": Field(encoding=Encoding.EXACT),
        "status": Field(encoding=Encoding.EXACT),
        "salary": Field(encoding=Encoding.NUMERIC, similar_within=10000),
    },
    primary_key="user_id",
)

# ----- DATA -----

users_data = pd.DataFrame([
    {"user_id": "U001", "name": "Alice Chen",
     "bio": "Machine learning engineer with NLP expertise",
     "department": "Engineering", "status": "active", "salary": 150000},
    {"user_id": "U002", "name": "Bob Smith",
     "bio": "Backend developer focusing on APIs",
     "department": "Engineering", "status": "active", "salary": 120000},
    {"user_id": "U003", "name": "Carol White",
     "bio": "ML researcher specializing in computer vision",
     "department": "Research", "status": "active", "salary": 140000},
    {"user_id": "U004", "name": "David Lee",
     "bio": "Data scientist with machine learning background",
     "department": "Engineering", "status": "inactive", "salary": 130000},
    {"user_id": "U005", "name": "Eve Johnson",
     "bio": "Deep learning specialist in NLP",
     "department": "Engineering", "status": "active", "salary": 160000},
])

# ----- INITIALIZATION -----

hb = HyperBinder(local=True, db_path="./fuzzy_exact_demo_db")

# Ingest into BOTH collections
# Search collection: Uses Catalog (fast semantic search)
hb.ingest(users_data, collection="users_search", schema=users_search_schema)

# CRUD collection: Uses RelationalTable (exact field updates)
hb.ingest(users_data, collection="users", schema=users_crud_schema)

# ----- BRIDGE PATTERN -----

# Stage 1: Semantic search via Catalog (fast, fuzzy)
candidates = hb.query("users_search", users_search_schema).search(
    "machine learning NLP expert",
    top_k=10,
)

print(f"Stage 1 - Fuzzy search found {len(candidates)} candidates")

# Stage 2: Exact filtering on candidate data
refined = []
for r in candidates:
    data = r.data
    if (data.get("department") == "Engineering" and
        data.get("status") == "active"):
        refined.append({
            "pk": data["user_id"],
            "data": data,
            "score": getattr(r, "score", 1.0),
        })

print(f"Stage 2 - After exact filters: {len(refined)} results")

# Stage 3: CRUD via RelationalTable (deterministic)
for r in refined:
    pk = r["pk"]

    # Get current row from CRUD collection (has salary field)
    current = hb.query("users", users_crud_schema).get(user_id=pk)

    if current and current.data.get("salary", 0) > 125000:
        print(f"  {pk}: {r['data']['name']} - eligible for update")

        # Safe to mutate - we have a deterministic PK
        hb.update(
            "users",
            where={"user_id": pk},
            set={"salary": int(current.data["salary"] * 1.1)},  # 10% raise
            schema=users_crud_schema,
        )

Output:

Stage 1 - Fuzzy search found 5 candidates
Stage 2 - After exact filters: 3 results
  U001: Alice Chen - eligible for update
  U005: Eve Johnson - eligible for update

Single-Schema Alternative¶

If you don't need separate optimization, RelationalTable can do both (with some search overhead):

# Single schema handles both search and CRUD
users_schema = RelationalTable(
    columns={
        "user_id": Field(encoding=Encoding.EXACT),
        "name": Field(encoding=Encoding.SEMANTIC),
        "bio": Field(encoding=Encoding.SEMANTIC),
        "department": Field(encoding=Encoding.EXACT),
        "status": Field(encoding=Encoding.EXACT),
        "salary": Field(encoding=Encoding.NUMERIC, similar_within=10000),
    },
    primary_key="user_id",
)

# Ingest once
hb.ingest(users_data, collection="users", schema=users_schema)

# Search and CRUD on same collection
candidates = hb.query("users", users_schema).search("ML expert", top_k=10)
# ... filter and mutate ...

Trade-off: Simpler setup, but search is slower than dedicated Catalog.

When to Use This Pattern¶

Use the bridge pattern when:

You need semantic discovery ("find people like X")
But require deterministic results for mutations
Your data has both semantic fields (name, bio) and exact fields (department, status)

Don't use this pattern when:

You already know the exact primary keys
Pure exact matching suffices (use filter() directly)
Read-only semantic search (no mutations needed)

Architecture Notes¶

Dual-Schema Architecture (Recommended)¶

Collection	Compound	Encoding	Purpose
`users_search`	Catalog	Search-optimized	Fast semantic discovery
`users`	RelationalTable	Structured	CRUD operations

Benefits: - Each compound is optimized for its use case - Search performance isn't degraded by CRUD overhead - Clear separation of concerns

Trade-off: Data is stored twice (search index + CRUD collection). Keep them in sync by re-ingesting when source data changes.

Single-Schema Architecture¶

If storage/sync overhead is a concern, use RelationalTable for both:

Collection	Compound	Encoding	Purpose
`users`	RelationalTable	Row	Both search and CRUD

Benefits: - Single source of truth - No sync issues

Trade-off: Search is slower than dedicated Catalog.

Sync Strategy¶

When using dual-schema, keep collections in sync:

# Option 1: Full re-ingest (simple, slower)
def sync_collections(data, hb):
    hb.ingest(data, collection="users_search", schema=search_schema)
    hb.ingest(data, collection="users", schema=crud_schema)

# Option 2: Incremental (complex, faster)
# - CRUD changes go to RelationalTable only
# - Periodic batch sync to Catalog for search
# - Accept that search may be slightly stale

The bridge pattern works regardless of sync strategy because it always reads current PKs from the search index, then operates on the CRUD collection.