Skip to content

Intersections Tutorial

A step-by-step guide to understanding how cross-collection joins work in HyperBinder.

What you'll learn:

  • Declaring intersections between collections
  • How the join mechanics work internally
  • Working with JoinedResult and match status
  • Output formats and filtering

The Problem

You have two collections:

  • employees: {employee_id, name, dept}
  • expertise: {subject, skill, level}

They're connected: employees.employee_id links to expertise.subject.

How do you query employees and get their skills in one operation?


Step 1: Declare the Intersection

An intersection declares the relationship between two collections:

from hybi import HyperBinder

hb = HyperBinder("http://localhost:8000")

# Declare: employees.employee_id links to expertise.subject
intersection = hb.intersect("employees.employee_id", "expertise.subject")

This tells HyperBinder:

"When I query 'employees' and join to 'expertise', match rows where employees.employee_id = expertise.subject"

The intersection is stored in a registry and can be reused.


Step 2: Sample Data

# Employees
employees = [
    {"employee_id": "EMP001", "name": "Alice", "dept": "Engineering"},
    {"employee_id": "EMP002", "name": "Bob", "dept": "Engineering"},
    {"employee_id": "EMP003", "name": "Charlie", "dept": "Sales"},
]

# Expertise (skills held by employees)
expertise = [
    {"subject": "EMP001", "skill": "Python", "level": "Expert"},
    {"subject": "EMP001", "skill": "Rust", "level": "Intermediate"},
    {"subject": "EMP002", "skill": "JavaScript", "level": "Expert"},
    {"subject": "EMP002", "skill": "Python", "level": "Beginner"},
    # Note: EMP003 has no expertise records
]

Step 3: Understanding the Join Mechanics

When you use .join(), HyperBinder matches rows based on the declared intersection:

What happens:

Employee Matching Expertise Status
Alice (EMP001) Python (Expert), Rust (Intermediate) MATCHED
Bob (EMP002) JavaScript (Expert), Python (Beginner) MATCHED
Charlie (EMP003) (none) NO_MATCH

Alice and Bob each match multiple expertise rows, so they appear multiple times in the results.


Step 4: Working with JoinedResult

Each result has helpful properties:

for result in joined_results:
    # Check match status
    if result.is_matched:
        # Confident match - safe to access target
        print(f"MATCHED: {result.source['name']} knows {result.target['skill']}")

    elif result.is_null:
        # Ambiguous match (multiple close candidates with similar scores)
        print(f"AMBIGUOUS: {result.source['name']} - unclear match")

    elif result.is_no_match:
        # No matching row found
        print(f"NO MATCH: {result.source['name']} - no expertise on file")

Output:

MATCHED: Alice knows Python
MATCHED: Alice knows Rust
MATCHED: Bob knows JavaScript
MATCHED: Bob knows Python
NO MATCH: Charlie - no expertise on file

Match Status Values

Status Meaning When it happens
MATCHED Confident match found Clear best match above threshold
NULL Ambiguous match Multiple candidates with similar scores
NO_MATCH No match found No candidates above threshold

Step 5: Filtering Results

Wrap results in JoinedResultSet for filtering utilities:

from hybi.compose.intersections import JoinedResultSet

result_set = JoinedResultSet(
    results=joined_results,
    intersection=intersection,
    source_count=len(employees),
    target_count=len(expertise),
)

# Get only matched results
matched = result_set.filter_matched()
print(f"Matched: {len(matched)} of {len(result_set)}")

# Statistics
print(f"Matched count: {result_set.matched_count}")
print(f"Null count: {result_set.null_count}")
print(f"No match count: {result_set.no_match_count}")
print(f"Expansion ratio: {result_set.expansion_ratio:.2f}x")

Output:

Matched: 4 of 5
Matched count: 4
Null count: 0
No match count: 1
Expansion ratio: 1.67x

The expansion ratio shows fan-out: 3 employees became 5 results (some matched multiple expertise rows).


Step 6: Output Formats

JoinedResult supports multiple access patterns:

result.source['name']     # → "Alice"
result.target['skill']    # → "Python"

Flat Dictionary

Keys are prefixed with collection names:

result.to_flat()
# {
#   'employees.employee_id': 'EMP001',
#   'employees.name': 'Alice',
#   'employees.dept': 'Engineering',
#   'expertise.subject': 'EMP001',
#   'expertise.skill': 'Python',
#   'expertise.level': 'Expert',
#   '_score': 1.0,
#   '_status': 'MATCHED'
# }

Nested Dictionary

Grouped by collection:

result.to_nested()
# {
#   'employees': {'employee_id': 'EMP001', 'name': 'Alice', 'dept': 'Engineering'},
#   'expertise': {'subject': 'EMP001', 'skill': 'Python', 'level': 'Expert'}
# }

Step 7: Using .join() in Practice

With the intersection declared, use .join() in queries:

# Query employees, join to expertise
results = (
    hb.query("employees", schema=employee_schema)
    .search("engineering")
    .join("expertise")
)

for r in results:
    if r.is_matched:
        print(f"{r.source['name']} knows {r.target['skill']}")

Chaining Joins

Join through multiple collections:

results = (
    hb.query("employees")
    .search("senior engineer")
    .join("expertise")       # employees → expertise
    .join("projects")        # expertise → projects
    .join("budgets")         # projects → budgets
)

Error Handling

No Intersection Declared

from hybi.compose.intersections import NoIntersectionError

try:
    results = hb.query("employees").search("...").join("unknown_collection")
except NoIntersectionError as e:
    print(f"No intersection defined between employees and unknown_collection")

Circular Joins

from hybi.compose.intersections import CircularJoinError

try:
    results = query.join("A").join("B").join("A")  # Cycle!
except CircularJoinError as e:
    print(f"Detected cycle: {e.path}")

Complete Example

#!/usr/bin/env python3
"""Intersections Tutorial: Complete Example"""

from hybi import HyperBinder
from hybi.compose import Triple, Field, Encoding

# Connect
hb = HyperBinder("http://localhost:8000")

# Define schemas
employee_schema = Triple(
    subject=Field("employee_id", encoding=Encoding.EXACT),
    predicate=Field("role"),
    object=Field("department"),
)

expertise_schema = Triple(
    subject=Field("employee_id", encoding=Encoding.EXACT),
    predicate=Field("skill"),
    object=Field("level"),
)

# Ingest data
hb.ingest(employees_df, collection="employees", schema=employee_schema)
hb.ingest(expertise_df, collection="expertise", schema=expertise_schema)

# Declare intersection
hb.intersect("employees.employee_id", "expertise.employee_id")

# Query with join
results = (
    hb.query("employees", schema=employee_schema)
    .find(department="Engineering")
    .join("expertise")
)

# Process results
for r in results.filter_matched():
    print(f"{r.source['employee_id']}: {r.target['skill']} ({r.target['level']})")

Step 8: Cross-Encoding Joins (Flexible Mode)

What if your fields have different encoding types? For example:

  • employees.employee_id uses EXACT encoding
  • expertise.topic uses SEMANTIC encoding

By default, these can't intersect—their encodings are incompatible. Flexible mode solves this with explicit link bindings.

The Problem

# This won't work in strict mode:
# EXACT (employee_id) ↔ SEMANTIC (topic) = Encoding mismatch!
hb.intersect("employees.employee_id", "expertise.topic")  # Error!

The Solution: Flexible Intersections

# 1. Declare a FLEXIBLE intersection
ix = hb.intersect_flexible("employees.employee_id", "expertise.topic")

# 2. Provide explicit link mappings
links_df = pd.DataFrame({
    "emp_id": ["EMP001", "EMP002", "EMP003"],
    "topic": ["machine learning", "databases", "cloud computing"]
})
hb.populate_links(ix, links_df, "emp_id", "topic")

# 3. Now the join works!
results = (
    hb.query("employees")
    .filter(employee_id="EMP001")
    .join("expertise")
)

for r in results:
    if r.is_matched:
        print(f"{r.source['employee_id']}{r.target['topic']}")
        # EMP001 → machine learning

Links are bidirectional value mappings:

Source (employee_id) Target (topic)
EMP001 machine learning
EMP002 databases
EMP003 cloud computing

The join uses these mappings instead of encoding-based matching:

  1. Query employees, get source values (EMP001, EMP002)
  2. Look up link mappings → ["machine learning", "databases"]
  3. Match against target results
  4. Return joined rows

A single source can link to multiple targets:

links_df = pd.DataFrame({
    "emp_id": ["EMP001", "EMP001", "EMP002"],  # EMP001 appears twice!
    "topic": ["machine learning", "AI", "databases"]
})
hb.populate_links(ix, links_df, "emp_id", "topic")

# EMP001 now matches BOTH "machine learning" AND "AI"

When to Use Flexible Mode

Use Flexible Mode Use Strict Mode
EXACT↔SEMANTIC fields Same encoding types
Explicit value mappings needed Natural equality works
Foreign-key-like relationships Self-joining collections

Key Takeaways

  1. DECLARE: hb.intersect("source.field", "target.field")
  2. Tells HyperBinder how collections relate

  3. JOIN: hb.query("source").search("...").join("target")

  4. Executes the cross-collection query

  5. ACCESS: result.source["field"], result.target["field"]

  6. Direct access to matched data

  7. CHECK: result.is_matched, result.is_null, result.is_no_match

  8. Know the quality of each match

  9. FILTER: result_set.filter_matched()

  10. Get only confident matches

  11. FLEXIBLE MODE: hb.intersect_flexible() + hb.populate_links()

  12. Enable cross-encoding joins with explicit mappings

Next Steps