Skip to content

Intersections

Intersections are the glue that connects collections, enabling cross-collection queries. They enable you to seamlessly pipe queries across collections specialized for different cases such as semantic search, fuzzy matching, and exact lookups, as specified by composition schemas.

Intersections allow you to easily create sophisticated logic that would normally require complex custom code across multiple database types.

The Problem

By default, collections are isolated islands:

flowchart LR
    E[Employees] ~~~ X[Expertise] ~~~ P[Projects]

You can query each independently, but you can't ask questions that span them—like "What skills does the ML team have?" or "Which projects need Python experts?"

The Solution

Intersections declare relationships between collections:

# Declare: employees.employee_id links to expertise.subject
hb.intersect("employees.employee_id", "expertise.subject")

Now the collections are connected:

flowchart LR
    E[Employees] <-->|employee_id = subject| X[Expertise]

And you can query across them:

results = (
    hb.query("employees")
    .search("ML engineer")
    .join("expertise")
)

for r in results:
    print(f"{r.source['name']} knows {r.target['skill']}")

Three Types of Matching

Every intersection picks one of three strategies for deciding whether a source row matches a target row:

Relation How it matches Use for
identity Exact value equality IDs, foreign keys, categorical values
semantic Embedding cosine similarity Text content, descriptions, fuzzy matching
link Explicit source→target table Cross-encoding joins, curated mappings
# Identity: exact match on IDs
hb.intersect("orders.customer_id", "customers.id")

# Semantic: fuzzy match on text
hb.intersect("emails.content", "projects.description", relation="semantic")

# Link: a hand-declared correspondence (see below)
ix = hb.intersect_flexible("employees.employee_id", "expertise.topic")
hb.populate_links(ix, links_df, "emp_id", "topic")

identity and semantic infer matches from the values themselves. link skips inference entirely — it looks matches up in a table you populate ahead of time. That table lives in its own link collection, which is itself a first-class object.

Strict vs Flexible Mode

By default, intersections use strict mode, which only allows connections between fields of the same encoding type (EXACT↔EXACT, SEMANTIC↔SEMANTIC).

Flexible mode enables cross-encoding intersections using link-style explicit mappings. Declaring an intersection with relation="link" automatically enables FLEXIBLE mode.

Mode Allowed Pairs Use When
STRICT (default) Same encoding types Fields share natural equality or semantic comparability
FLEXIBLE Any encoding types Fields are related but the system can't infer how

Some joins can't be inferred from the data alone. Consider:

employees.employee_id = "EMP001"        # EXACT encoding — opaque identifier
expertise.topic       = "machine learning"  # SEMANTIC encoding — natural language

Neither of the two inferential strategies works here:

  • identity fails"EMP001" != "machine learning". Exact equality has nothing to work with.
  • semantic fails — the embedding of the opaque string "EMP001" has nothing to do with the embedding of "machine learning". Cosine similarity is near-random. The fields are encoded in different spaces calibrated to different distributions (see Embeddings for why per-collection calibration prevents cross-encoding vector comparison).

The real correspondence between employees and topics lives outside the data itself — in an HR record, a labeling process, an ETL pipeline. Links let you surface that correspondence as a first-class object without forcing it into either collection's schema.

Link joins are a table lookup, not a similarity computation:

  1. You populate a link collection — a pre-computed mapping of source values to target values (one-to-many supported).
  2. At join time, link_glue takes each source row, looks up its target values in the table, and emits a JoinedResult for each target row matching those values.
  3. No embedding, no threshold — the match is deterministic and cheap (dict lookup per source row).
# Link data: source_value → [target_value, ...]
# {
#   "EMP001": ["machine learning", "deep learning"],   # one-to-many is fine
#   "EMP002": ["databases"],
#   "EMP003": ["cloud computing"],
# }

# Join
results = hb.query("employees").filter(employee_id="EMP001").join("expertise")
# Emits one JoinedResult per expertise row matching "machine learning" OR "deep learning"

The runtime cost of a link join is O(sources × avg_fan_out) — no vector math, no re-encoding. That makes links the cheapest cross-collection join at query time, once the link collection is populated.

A link collection isn't a hidden implementation detail — it's a regular HyperBinder collection you can query, update, and version independently of either end.

Bidirectional. Joins work in either direction off the same declaration:

# Forward: employees → expertise
results = hb.query("employees").search("Alice").join("expertise")

# Reverse: expertise → employees
results = hb.query("expertise").search("machine learning").join("employees")

One-to-many. A single source value can link to many target values:

LinkSet.from_pairs([
    ("EMP001", "machine learning"),
    ("EMP001", "deep learning"),
    ("EMP001", "ml-ops"),
])

Weighted. Links carry an optional weight that aggregates into the join score:

links_df = pd.DataFrame({
    "emp_id": ["EMP001", "EMP001"],
    "topic":  ["machine learning", "ml-ops"],
    "confidence": [0.95, 0.60],
})
hb.populate_links(ix, links_df, "emp_id", "topic", weight_column="confidence")

Updatable. Links can be refreshed as ground truth changes — reassigning EMP001 to a new topic is a single write to the link collection, not a mass update of employees or expertise. populate_links() fully replaces the link set for that intersection.

Introspectable via metadata. Each Link can carry arbitrary metadata (source system, annotation timestamp, reviewer) — useful for auditing where a given correspondence came from.

See Intersections API for the full reference on Link, LinkSet, populate_links(), and intersect_flexible().

Links are the right tool when:

  • The fields have different encodings and no amount of fine-tuning will make similarity work (EXACT ↔ SEMANTIC, HIERARCHICAL ↔ NUMERIC, etc.).
  • Ground truth lives elsewhere — an HR system, a labeled dataset, a manual curation process owns the correspondence.
  • You want versioning — links update independently; you can ship new correspondences without re-ingesting either collection.
  • Matches need to be many-to-many without denormalizing either side.

Alternatives to consider first:

  • Denormalization — if the correspondence is stable and small, just copy the target value into the source collection at ingest. Zero query-time overhead.
  • A shared encoding — if both fields could be made EXACT (both IDs) or both SEMANTIC (both natural-language descriptions), STRICT mode is simpler.
  • Multihop — for relationships within one collection, the multihop primitive is more direct than intersections.

Reach for links when the relationship is genuinely external to both collections and inference can't bridge it.

Chaining Joins

Connect multiple collections in one query:

results = (
    hb.query("employees")
    .search("senior engineer")
    .join("expertise")      # employees → expertise
    .join("projects")       # expertise → projects
    .join("budgets")        # projects → budgets
)
flowchart LR
    E[Employees] --> X[Expertise] --> P[Projects] --> B[Budgets]

The Bridge Pattern

A powerful architecture for connecting heterogeneous data:

flowchart LR
    D["Documents<br/>(fuzzy text)"] <-->|semantic| K["Knowledge Graph<br/>(entities)"] <-->|identity| T["Tables<br/>(exact data)"]

The Knowledge Graph acts as an index—semantic search finds relevant entities, which link to exact structured records.

Example: Find budget info for projects mentioned in emails:

hb.intersect("emails.content", "projects.description", relation="semantic")
hb.intersect("projects.project_id", "budgets.project_id", relation="identity")

results = (
    hb.query("emails")
    .search("Q2 budget concerns")
    .join("projects")   # semantic: email text → project
    .join("budgets")    # identity: project ID → budget
)

Match Quality

Each joined result has a status:

Status Meaning
MATCHED Confident match found
NULL Ambiguous (multiple close candidates)
NO_MATCH No match above threshold
for r in results:
    if r.is_matched:
        # Safe to use r.target
        print(f"{r.source['name']}{r.target['skill']}")
    elif r.is_no_match:
        print(f"{r.source['name']} has no matching expertise")

When to Use Intersections

Use intersections when:

  • Data naturally lives in separate collections (different schemas)
  • You need to answer questions spanning multiple data types
  • You want to connect fuzzy (semantic) and exact (symbolic) data

Don't use intersections when:

  • All data fits in one collection
  • Relationships are within the same collection (use multihop instead)

Next Steps