Intersections¶
Intersections are the glue that connects collections, enabling cross-collection queries. They enable you to seamlessly pipe queries across collections specialized for different cases such as semantic search, fuzzy matching, and exact lookups, as specified by composition schemas.
Intersections allow you to easily create sophisticated logic that would normally require complex custom code across multiple database types.
The Problem¶
By default, collections are isolated islands:
flowchart LR
E[Employees] ~~~ X[Expertise] ~~~ P[Projects]
You can query each independently, but you can't ask questions that span them—like "What skills does the ML team have?" or "Which projects need Python experts?"
The Solution¶
Intersections declare relationships between collections:
# Declare: employees.employee_id links to expertise.subject
hb.intersect("employees.employee_id", "expertise.subject")
Now the collections are connected:
flowchart LR
E[Employees] <-->|employee_id = subject| X[Expertise]
And you can query across them:
results = (
hb.query("employees")
.search("ML engineer")
.join("expertise")
)
for r in results:
print(f"{r.source['name']} knows {r.target['skill']}")
Three Types of Matching¶
Every intersection picks one of three strategies for deciding whether a source row matches a target row:
| Relation | How it matches | Use for |
|---|---|---|
identity |
Exact value equality | IDs, foreign keys, categorical values |
semantic |
Embedding cosine similarity | Text content, descriptions, fuzzy matching |
link |
Explicit source→target table | Cross-encoding joins, curated mappings |
# Identity: exact match on IDs
hb.intersect("orders.customer_id", "customers.id")
# Semantic: fuzzy match on text
hb.intersect("emails.content", "projects.description", relation="semantic")
# Link: a hand-declared correspondence (see below)
ix = hb.intersect_flexible("employees.employee_id", "expertise.topic")
hb.populate_links(ix, links_df, "emp_id", "topic")
identity and semantic infer matches from the values themselves. link skips inference entirely — it looks matches up in a table you populate ahead of time. That table lives in its own link collection, which is itself a first-class object.
Strict vs Flexible Mode¶
By default, intersections use strict mode, which only allows connections between fields of the same encoding type (EXACT↔EXACT, SEMANTIC↔SEMANTIC).
Flexible mode enables cross-encoding intersections using link-style explicit mappings. Declaring an intersection with relation="link" automatically enables FLEXIBLE mode.
| Mode | Allowed Pairs | Use When |
|---|---|---|
STRICT (default) |
Same encoding types | Fields share natural equality or semantic comparability |
FLEXIBLE |
Any encoding types | Fields are related but the system can't infer how |
Why Links Exist¶
Some joins can't be inferred from the data alone. Consider:
employees.employee_id = "EMP001" # EXACT encoding — opaque identifier
expertise.topic = "machine learning" # SEMANTIC encoding — natural language
Neither of the two inferential strategies works here:
identityfails —"EMP001" != "machine learning". Exact equality has nothing to work with.semanticfails — the embedding of the opaque string"EMP001"has nothing to do with the embedding of"machine learning". Cosine similarity is near-random. The fields are encoded in different spaces calibrated to different distributions (see Embeddings for why per-collection calibration prevents cross-encoding vector comparison).
The real correspondence between employees and topics lives outside the data itself — in an HR record, a labeling process, an ETL pipeline. Links let you surface that correspondence as a first-class object without forcing it into either collection's schema.
How Link Matching Works¶
Link joins are a table lookup, not a similarity computation:
- You populate a link collection — a pre-computed mapping of source values to target values (one-to-many supported).
- At join time,
link_gluetakes each source row, looks up its target values in the table, and emits aJoinedResultfor each target row matching those values. - No embedding, no threshold — the match is deterministic and cheap (dict lookup per source row).
# Link data: source_value → [target_value, ...]
# {
# "EMP001": ["machine learning", "deep learning"], # one-to-many is fine
# "EMP002": ["databases"],
# "EMP003": ["cloud computing"],
# }
# Join
results = hb.query("employees").filter(employee_id="EMP001").join("expertise")
# Emits one JoinedResult per expertise row matching "machine learning" OR "deep learning"
The runtime cost of a link join is O(sources × avg_fan_out) — no vector math, no re-encoding. That makes links the cheapest cross-collection join at query time, once the link collection is populated.
Links as First-Class Citizens¶
A link collection isn't a hidden implementation detail — it's a regular HyperBinder collection you can query, update, and version independently of either end.
Bidirectional. Joins work in either direction off the same declaration:
# Forward: employees → expertise
results = hb.query("employees").search("Alice").join("expertise")
# Reverse: expertise → employees
results = hb.query("expertise").search("machine learning").join("employees")
One-to-many. A single source value can link to many target values:
LinkSet.from_pairs([
("EMP001", "machine learning"),
("EMP001", "deep learning"),
("EMP001", "ml-ops"),
])
Weighted. Links carry an optional weight that aggregates into the join score:
links_df = pd.DataFrame({
"emp_id": ["EMP001", "EMP001"],
"topic": ["machine learning", "ml-ops"],
"confidence": [0.95, 0.60],
})
hb.populate_links(ix, links_df, "emp_id", "topic", weight_column="confidence")
Updatable. Links can be refreshed as ground truth changes — reassigning EMP001 to a new topic is a single write to the link collection, not a mass update of employees or expertise. populate_links() fully replaces the link set for that intersection.
Introspectable via metadata. Each Link can carry arbitrary metadata (source system, annotation timestamp, reviewer) — useful for auditing where a given correspondence came from.
See Intersections API for the full reference on Link, LinkSet, populate_links(), and intersect_flexible().
When to Reach for Links¶
Links are the right tool when:
- The fields have different encodings and no amount of fine-tuning will make similarity work (EXACT ↔ SEMANTIC, HIERARCHICAL ↔ NUMERIC, etc.).
- Ground truth lives elsewhere — an HR system, a labeled dataset, a manual curation process owns the correspondence.
- You want versioning — links update independently; you can ship new correspondences without re-ingesting either collection.
- Matches need to be many-to-many without denormalizing either side.
Alternatives to consider first:
- Denormalization — if the correspondence is stable and small, just copy the target value into the source collection at ingest. Zero query-time overhead.
- A shared encoding — if both fields could be made
EXACT(both IDs) or bothSEMANTIC(both natural-language descriptions), STRICT mode is simpler. - Multihop — for relationships within one collection, the multihop primitive is more direct than intersections.
Reach for links when the relationship is genuinely external to both collections and inference can't bridge it.
Chaining Joins¶
Connect multiple collections in one query:
results = (
hb.query("employees")
.search("senior engineer")
.join("expertise") # employees → expertise
.join("projects") # expertise → projects
.join("budgets") # projects → budgets
)
flowchart LR
E[Employees] --> X[Expertise] --> P[Projects] --> B[Budgets]
The Bridge Pattern¶
A powerful architecture for connecting heterogeneous data:
flowchart LR
D["Documents<br/>(fuzzy text)"] <-->|semantic| K["Knowledge Graph<br/>(entities)"] <-->|identity| T["Tables<br/>(exact data)"]
The Knowledge Graph acts as an index—semantic search finds relevant entities, which link to exact structured records.
Example: Find budget info for projects mentioned in emails:
hb.intersect("emails.content", "projects.description", relation="semantic")
hb.intersect("projects.project_id", "budgets.project_id", relation="identity")
results = (
hb.query("emails")
.search("Q2 budget concerns")
.join("projects") # semantic: email text → project
.join("budgets") # identity: project ID → budget
)
Match Quality¶
Each joined result has a status:
| Status | Meaning |
|---|---|
MATCHED |
Confident match found |
NULL |
Ambiguous (multiple close candidates) |
NO_MATCH |
No match above threshold |
for r in results:
if r.is_matched:
# Safe to use r.target
print(f"{r.source['name']} → {r.target['skill']}")
elif r.is_no_match:
print(f"{r.source['name']} has no matching expertise")
When to Use Intersections¶
Use intersections when:
- Data naturally lives in separate collections (different schemas)
- You need to answer questions spanning multiple data types
- You want to connect fuzzy (semantic) and exact (symbolic) data
Don't use intersections when:
- All data fits in one collection
- Relationships are within the same collection (use multihop instead)
Next Steps¶
- Intersections API - Full reference
- Intersections Tutorial - Step-by-step guide
- Enterprise Knowledge Example - Complex multi-collection patterns