Skip to content

Intersections

Intersections are the glue that connects collections, enabling cross-collection queries. They enable you to seamlessly pipe queries across collections specialized for different cases such as semantic search, fuzzy matching, and exact lookups, as specified by composition schemas.

Intersections allow you to easily create sophisticated logic that would normally require complex custom code across multiple database types.

The Problem

By default, collections are isolated islands:

flowchart LR
    E[Employees] ~~~ X[Expertise] ~~~ P[Projects]

You can query each independently, but you can't ask questions that span them—like "What skills does the ML team have?" or "Which projects need Python experts?"

The Solution

Intersections declare relationships between collections:

# Declare: employees.employee_id links to expertise.subject
hb.intersect("employees.employee_id", "expertise.subject")

Now the collections are connected:

flowchart LR
    E[Employees] <-->|employee_id = subject| X[Expertise]

And you can query across them:

results = (
    hb.query("employees")
    .search("ML engineer")
    .join("expertise")
)

for r in results:
    print(f"{r.source['name']} knows {r.target['skill']}")

Two Types of Matching

Relation How it matches Use for
identity Exact value equality IDs, foreign keys, categories
semantic Embedding similarity Text content, descriptions
# Identity: exact match on IDs
hb.intersect("orders.customer_id", "customers.id")

# Semantic: fuzzy match on text
hb.intersect("emails.content", "projects.description", relation="semantic")

Strict vs Flexible Mode

By default, intersections use strict mode, which only allows connections between fields of the same encoding type (EXACT↔EXACT, SEMANTIC↔SEMANTIC).

Flexible mode enables cross-encoding intersections using explicit links—declared mappings that tell HyperBinder exactly which values correspond.

Mode Allowed Pairs Use When
STRICT (default) Same encoding types Fields share natural equality
FLEXIBLE Any encoding types Need explicit value mappings

When to Use Flexible Mode

Flexible mode solves the cross-encoding problem:

# Problem: EXACT employee IDs don't match SEMANTIC topic descriptions
employees.employee_id = "EMP001"       # EXACT encoding
expertise.topic = "machine learning"   # SEMANTIC encoding

# "EMP001" and "machine learning" are semantically unrelated,
# but we need to connect them for queries!

The solution: Explicitly declare which values link together.

# 1. Declare flexible intersection
ix = hb.intersect_flexible("employees.employee_id", "expertise.topic")

# 2. Provide the link mappings
links_df = pd.DataFrame({
    "emp_id": ["EMP001", "EMP002", "EMP003"],
    "topic": ["machine learning", "databases", "cloud computing"]
})
hb.populate_links(ix, links_df, "emp_id", "topic")

# 3. Now cross-type joins work!
results = hb.query("employees").filter(employee_id="EMP001").join("expertise")
# Returns: EMP001 → machine learning

Links are bidirectional by default—you can join in either direction:

# Forward: employees → expertise
results = hb.query("employees").search("Alice").join("expertise")

# Reverse: expertise → employees
results = hb.query("expertise").search("machine learning").join("employees")

See Intersections API for full reference on Link, LinkSet, and populate_links().

Chaining Joins

Connect multiple collections in one query:

results = (
    hb.query("employees")
    .search("senior engineer")
    .join("expertise")      # employees → expertise
    .join("projects")       # expertise → projects
    .join("budgets")        # projects → budgets
)
flowchart LR
    E[Employees] --> X[Expertise] --> P[Projects] --> B[Budgets]

The Bridge Pattern

A powerful architecture for connecting heterogeneous data:

flowchart LR
    D["Documents<br/>(fuzzy text)"] <-->|semantic| K["Knowledge Graph<br/>(entities)"] <-->|identity| T["Tables<br/>(exact data)"]

The Knowledge Graph acts as an index—semantic search finds relevant entities, which link to exact structured records.

Example: Find budget info for projects mentioned in emails:

hb.intersect("emails.content", "projects.description", relation="semantic")
hb.intersect("projects.project_id", "budgets.project_id", relation="identity")

results = (
    hb.query("emails")
    .search("Q2 budget concerns")
    .join("projects")   # semantic: email text → project
    .join("budgets")    # identity: project ID → budget
)

Match Quality

Each joined result has a status:

Status Meaning
MATCHED Confident match found
NULL Ambiguous (multiple close candidates)
NO_MATCH No match above threshold
for r in results:
    if r.is_matched:
        # Safe to use r.target
        print(f"{r.source['name']}{r.target['skill']}")
    elif r.is_no_match:
        print(f"{r.source['name']} has no matching expertise")

When to Use Intersections

Use intersections when:

  • Data naturally lives in separate collections (different schemas)
  • You need to answer questions spanning multiple data types
  • You want to connect fuzzy (semantic) and exact (symbolic) data

Don't use intersections when:

  • All data fits in one collection
  • Relationships are within the same collection (use multihop instead)

Next Steps