Intersections Tutorial¶
A step-by-step guide to understanding how cross-collection joins work in HyperBinder.
What you'll learn:
- Declaring intersections between collections
- How the join mechanics work internally
- Working with JoinedResult and match status
- Output formats and filtering
The Problem¶
You have two collections:
- employees:
{employee_id, name, dept} - expertise:
{subject, skill, level}
They're connected: employees.employee_id links to expertise.subject.
How do you query employees and get their skills in one operation?
Step 1: Declare the Intersection¶
An intersection declares the relationship between two collections:
from hybi import HyperBinder
hb = HyperBinder("http://localhost:8000")
# Declare: employees.employee_id links to expertise.subject
intersection = hb.intersect("employees.employee_id", "expertise.subject")
This tells HyperBinder:
"When I query 'employees' and join to 'expertise', match rows where
employees.employee_id = expertise.subject"
The intersection is stored in a registry and can be reused.
Step 2: Sample Data¶
# Employees
employees = [
{"employee_id": "EMP001", "name": "Alice", "dept": "Engineering"},
{"employee_id": "EMP002", "name": "Bob", "dept": "Engineering"},
{"employee_id": "EMP003", "name": "Charlie", "dept": "Sales"},
]
# Expertise (skills held by employees)
expertise = [
{"subject": "EMP001", "skill": "Python", "level": "Expert"},
{"subject": "EMP001", "skill": "Rust", "level": "Intermediate"},
{"subject": "EMP002", "skill": "JavaScript", "level": "Expert"},
{"subject": "EMP002", "skill": "Python", "level": "Beginner"},
# Note: EMP003 has no expertise records
]
Step 3: Understanding the Join Mechanics¶
When you use .join(), HyperBinder matches rows based on the declared intersection:
What happens:
| Employee | Matching Expertise | Status |
|---|---|---|
| Alice (EMP001) | Python (Expert), Rust (Intermediate) | MATCHED |
| Bob (EMP002) | JavaScript (Expert), Python (Beginner) | MATCHED |
| Charlie (EMP003) | (none) | NO_MATCH |
Alice and Bob each match multiple expertise rows, so they appear multiple times in the results.
Step 4: Working with JoinedResult¶
Each result has helpful properties:
for result in joined_results:
# Check match status
if result.is_matched:
# Confident match - safe to access target
print(f"MATCHED: {result.source['name']} knows {result.target['skill']}")
elif result.is_null:
# Ambiguous match (multiple close candidates with similar scores)
print(f"AMBIGUOUS: {result.source['name']} - unclear match")
elif result.is_no_match:
# No matching row found
print(f"NO MATCH: {result.source['name']} - no expertise on file")
Output:
MATCHED: Alice knows Python
MATCHED: Alice knows Rust
MATCHED: Bob knows JavaScript
MATCHED: Bob knows Python
NO MATCH: Charlie - no expertise on file
Match Status Values¶
| Status | Meaning | When it happens |
|---|---|---|
MATCHED |
Confident match found | Clear best match above threshold |
NULL |
Ambiguous match | Multiple candidates with similar scores |
NO_MATCH |
No match found | No candidates above threshold |
Step 5: Filtering Results¶
Wrap results in JoinedResultSet for filtering utilities:
from hybi.compose.intersections import JoinedResultSet
result_set = JoinedResultSet(
results=joined_results,
intersection=intersection,
source_count=len(employees),
target_count=len(expertise),
)
# Get only matched results
matched = result_set.filter_matched()
print(f"Matched: {len(matched)} of {len(result_set)}")
# Statistics
print(f"Matched count: {result_set.matched_count}")
print(f"Null count: {result_set.null_count}")
print(f"No match count: {result_set.no_match_count}")
print(f"Expansion ratio: {result_set.expansion_ratio:.2f}x")
Output:
The expansion ratio shows fan-out: 3 employees became 5 results (some matched multiple expertise rows).
Step 6: Output Formats¶
JoinedResult supports multiple access patterns:
Direct Access (Recommended)¶
Flat Dictionary¶
Keys are prefixed with collection names:
result.to_flat()
# {
# 'employees.employee_id': 'EMP001',
# 'employees.name': 'Alice',
# 'employees.dept': 'Engineering',
# 'expertise.subject': 'EMP001',
# 'expertise.skill': 'Python',
# 'expertise.level': 'Expert',
# '_score': 1.0,
# '_status': 'MATCHED'
# }
Nested Dictionary¶
Grouped by collection:
result.to_nested()
# {
# 'employees': {'employee_id': 'EMP001', 'name': 'Alice', 'dept': 'Engineering'},
# 'expertise': {'subject': 'EMP001', 'skill': 'Python', 'level': 'Expert'}
# }
Step 7: Using .join() in Practice¶
With the intersection declared, use .join() in queries:
# Query employees, join to expertise
results = (
hb.query("employees", schema=employee_schema)
.search("engineering")
.join("expertise")
)
for r in results:
if r.is_matched:
print(f"{r.source['name']} knows {r.target['skill']}")
Chaining Joins¶
Join through multiple collections:
results = (
hb.query("employees")
.search("senior engineer")
.join("expertise") # employees → expertise
.join("projects") # expertise → projects
.join("budgets") # projects → budgets
)
Error Handling¶
No Intersection Declared¶
from hybi.compose.intersections import NoIntersectionError
try:
results = hb.query("employees").search("...").join("unknown_collection")
except NoIntersectionError as e:
print(f"No intersection defined between employees and unknown_collection")
Circular Joins¶
from hybi.compose.intersections import CircularJoinError
try:
results = query.join("A").join("B").join("A") # Cycle!
except CircularJoinError as e:
print(f"Detected cycle: {e.path}")
Complete Example¶
#!/usr/bin/env python3
"""Intersections Tutorial: Complete Example"""
from hybi import HyperBinder
from hybi.compose import Triple, Field, Encoding
# Connect
hb = HyperBinder("http://localhost:8000")
# Define schemas
employee_schema = Triple(
subject=Field("employee_id", encoding=Encoding.EXACT),
predicate=Field("role"),
object=Field("department"),
)
expertise_schema = Triple(
subject=Field("employee_id", encoding=Encoding.EXACT),
predicate=Field("skill"),
object=Field("level"),
)
# Ingest data
hb.ingest(employees_df, collection="employees", schema=employee_schema)
hb.ingest(expertise_df, collection="expertise", schema=expertise_schema)
# Declare intersection
hb.intersect("employees.employee_id", "expertise.employee_id")
# Query with join
results = (
hb.query("employees", schema=employee_schema)
.find(department="Engineering")
.join("expertise")
)
# Process results
for r in results.filter_matched():
print(f"{r.source['employee_id']}: {r.target['skill']} ({r.target['level']})")
Step 8: Cross-Encoding Joins (Flexible Mode)¶
What if your fields have different encoding types? For example:
employees.employee_iduses EXACT encodingexpertise.topicuses SEMANTIC encoding
By default, these can't intersect—their encodings are incompatible. Flexible mode solves this with explicit link bindings.
The Problem¶
# This won't work in strict mode:
# EXACT (employee_id) ↔ SEMANTIC (topic) = Encoding mismatch!
hb.intersect("employees.employee_id", "expertise.topic") # Error!
The Solution: Flexible Intersections¶
# 1. Declare a FLEXIBLE intersection
ix = hb.intersect_flexible("employees.employee_id", "expertise.topic")
# 2. Provide explicit link mappings
links_df = pd.DataFrame({
"emp_id": ["EMP001", "EMP002", "EMP003"],
"topic": ["machine learning", "databases", "cloud computing"]
})
hb.populate_links(ix, links_df, "emp_id", "topic")
# 3. Now the join works!
results = (
hb.query("employees")
.filter(employee_id="EMP001")
.join("expertise")
)
for r in results:
if r.is_matched:
print(f"{r.source['employee_id']} → {r.target['topic']}")
# EMP001 → machine learning
How Links Work¶
Links are bidirectional value mappings:
| Source (employee_id) | Target (topic) |
|---|---|
| EMP001 | machine learning |
| EMP002 | databases |
| EMP003 | cloud computing |
The join uses these mappings instead of encoding-based matching:
- Query employees, get source values (
EMP001,EMP002) - Look up link mappings →
["machine learning", "databases"] - Match against target results
- Return joined rows
One-to-Many Links¶
A single source can link to multiple targets:
links_df = pd.DataFrame({
"emp_id": ["EMP001", "EMP001", "EMP002"], # EMP001 appears twice!
"topic": ["machine learning", "AI", "databases"]
})
hb.populate_links(ix, links_df, "emp_id", "topic")
# EMP001 now matches BOTH "machine learning" AND "AI"
When to Use Flexible Mode¶
| Use Flexible Mode | Use Strict Mode |
|---|---|
| EXACT↔SEMANTIC fields | Same encoding types |
| Explicit value mappings needed | Natural equality works |
| Foreign-key-like relationships | Self-joining collections |
Key Takeaways¶
- DECLARE:
hb.intersect("source.field", "target.field") -
Tells HyperBinder how collections relate
-
JOIN:
hb.query("source").search("...").join("target") -
Executes the cross-collection query
-
ACCESS:
result.source["field"],result.target["field"] -
Direct access to matched data
-
CHECK:
result.is_matched,result.is_null,result.is_no_match -
Know the quality of each match
-
FILTER:
result_set.filter_matched() -
Get only confident matches
-
FLEXIBLE MODE:
hb.intersect_flexible()+hb.populate_links() - Enable cross-encoding joins with explicit mappings
Next Steps¶
- Intersections API Reference - Full API documentation
- Enterprise Knowledge Example - Complex multi-collection queries
- The Compose System - Understanding the full architecture