Fields & Encoding¶
Configure how individual fields are encoded into hypervectors.
Field¶
The Field class configures a single column's encoding behavior.
from hybi.compose import Field, Encoding
Field(
name="description", # Column name (optional, inferred from slot)
encoding=Encoding.SEMANTIC, # How to encode values
weight=1.5, # Importance in similarity (default 1.0)
similar_within=0.1, # Scale for NUMERIC/TEMPORAL encoding
searchable=True, # Include in search (default True)
mode="rank", # 'filter' or 'rank' (default 'rank')
threshold=0.0, # Minimum similarity for filter mode (0-1)
)
hybi.compose.Field
dataclass
¶
Configuration for a single field in a Compose schema.
Field.name specifies which DataFrame COLUMN to use for this slot. The slot name (e.g., "subject" in Triple) is separate from the column name.
Column Name Resolution
- If Field.name is provided, use that as the column name
- If Field.name is None, use the slot name as the column name
Example
Column name matches slot name (most common)¶
Triple( subject=Field("subject"), # Uses column "subject" predicate=Field("predicate"), object=Field("object"), )
Column name differs from slot name¶
Triple( subject=Field("entity_name"), # Uses column "entity_name" for subject slot predicate=Field("relation_type"), object=Field("target_entity"), )
Field with custom encoding¶
Field("category", encoding=Encoding.EXACT)
Field with weight boost¶
Field("description", weight=2.0)
Numeric field with scale preset (recommended)¶
Field("price", encoding=Encoding.NUMERIC, similar_within=NumericScale.DOLLARS)
Numeric field with custom scale¶
Field("score", encoding=Encoding.NUMERIC, similar_within=25)
name = None
class-attribute
instance-attribute
¶
DataFrame column name to use for this slot.
If None, the slot name is used as the column name. Example: Triple(subject=Field()) uses the "subject" column.
encoding = Encoding.SEMANTIC
class-attribute
instance-attribute
¶
How values are encoded into hypervectors.
weight = 1.0
class-attribute
instance-attribute
¶
Importance weight in similarity calculations.
Higher weights make this field more influential in search. The final score is a weighted average: Σ(similarity × weight) / Σ(weight).
similar_within = 0.1
class-attribute
instance-attribute
¶
Scale for NUMERIC encoding: the distance at which values are "similar".
Values within this distance have ~60% similarity. Values at 2× this distance have ~14% similarity. Values at 3× this distance have ~1% similarity.
Use NumericScale presets for common data types
similar_within=NumericScale.DOLLARS # $50 difference = similar similar_within=NumericScale.RATING_5 # 0.5 star difference = similar similar_within=NumericScale.PERCENTAGE # 5 points difference = similar
Or use a custom number
similar_within=100 # 100 units difference = similar
For NUMERIC: controls search scoring distance (default 0.1). For TEMPORAL: controls encoding resolution in seconds (use TemporalResolution presets).
searchable = True
class-attribute
instance-attribute
¶
Whether to include this field in search queries.
required = False
class-attribute
instance-attribute
¶
Whether the field must be present (not null) in data.
mode = MODE_RANK
class-attribute
instance-attribute
¶
Query mode: 'filter' or 'rank'.
'filter': candidates with similarity < threshold on this field are eliminated. Multiple filter fields are conjunctive (AND). Filter fields do NOT contribute to the final ranking score.
'rank': candidates are scored by weighted similarity on this field. The final score is weighted average of rank-field similarities among candidates that survived all filters.
Default: 'rank' (existing behavior).
threshold = DEFAULT_THRESHOLD
class-attribute
instance-attribute
¶
Minimum similarity for filter mode.
Only used when mode='filter'. Candidates with similarity below this threshold on this field are eliminated before ranking.
For exact encoding, 0.5 is a good threshold (exact matches score ~1.0, mismatches score ~0.0). For semantic encoding, use lower thresholds (0.3-0.5) to allow fuzzy matching.
Default: 0.0 (no filtering in rank mode).
__init__(name=None, encoding=Encoding.SEMANTIC, weight=1.0, similar_within=0.1, mode=MODE_RANK, threshold=DEFAULT_THRESHOLD, searchable=True, required=False, phase_dim=None)
¶
Encoding¶
The Encoding enum specifies how values become vectors.
| Encoding | Behavior | Use When |
|---|---|---|
SEMANTIC |
Similar meaning → similar vectors | Text, names, descriptions |
EXACT |
Each unique value → distinct vector | IDs, categories, types |
NUMERIC |
Nearby numbers → similar vectors | Prices, counts, ratings |
TEMPORAL |
Nearby dates → similar vectors | Timestamps, dates, event times |
HIERARCHICAL |
Parent-child similarity encoding | Taxonomies, org structures |
from hybi.compose import Encoding
# Semantic: "apple" is similar to "fruit"
Field("description", encoding=Encoding.SEMANTIC)
# Exact: "category_a" is NOT similar to "category_b"
Field("type", encoding=Encoding.EXACT)
# Numeric: 100 is similar to 105, dissimilar to 1000
Field("price", encoding=Encoding.NUMERIC, similar_within=50)
# Temporal: dates days apart are similar, months apart are not
Field("created", encoding=Encoding.TEMPORAL)
hybi.compose.Encoding
¶
Bases: Enum
How a field value is encoded into hypervectors.
Different encoding types are suited for different data types and query patterns.
SEMANTIC = auto()
class-attribute
instance-attribute
¶
Similar values produce similar vectors (default).
Best for: text, embeddings, semantic content. Enables: similarity search, semantic matching.
EXACT = auto()
class-attribute
instance-attribute
¶
Each unique value gets a random orthogonal vector.
Best for: categorical values, IDs, enums. Enables: exact match queries, slot-based unbinding.
NUMERIC = auto()
class-attribute
instance-attribute
¶
Proximity-preserving encoding for continuous values.
Nearby values produce similar representations; distant values are dissimilar. Search scoring uses numeric distance (controlled by similar_within).
Best for: prices, measurements, quantities, counts. Enables: range queries, numeric similarity, proximity-aware compounds.
TEMPORAL = auto()
class-attribute
instance-attribute
¶
Proximity-preserving encoding for dates and timestamps.
Nearby dates produce similar representations; distant dates are dissimilar. Default resolution is day-level. Finer resolution (hour, minute) is planned via the TemporalResolution presets — currently day-level only.
Search scoring uses recency-weighted temporal distance.
Storage: Unix epoch seconds (i64 precision). Timezone: Naive datetime/strings use local time; timezone-aware respected. NULLs: Field omitted from storage; queries treat as "no match".
Best for: dates, event times, filing dates, transaction dates. Enables: temporal queries, time-range filtering, time-aware compounds.
HIERARCHICAL = auto()
class-attribute
instance-attribute
¶
Parent-child similarity encoding.
Best for: categories, taxonomies, org structures. Enables: hierarchical queries, level-aware matching.
Numeric Encoding¶
NUMERIC encoding produces proximity-preserving vectors: nearby numbers are similar, distant numbers are dissimilar. This means NUMERIC fields contribute meaningful signal to compound similarity — a record with price=100 is more similar to price=105 than to price=10000.
The similar_within parameter controls search scoring scale (the distance at which values have ~60% similarity). Use NumericScale presets for common data types:
from hybi.compose import Field, Encoding, NumericScale
# Prices in dollars
Field("price", encoding=Encoding.NUMERIC, similar_within=NumericScale.DOLLARS)
# 5-star ratings
Field("rating", encoding=Encoding.NUMERIC, similar_within=NumericScale.RATING_5)
# Percentages
Field("score", encoding=Encoding.NUMERIC, similar_within=NumericScale.PERCENTAGE)
# Custom scale
Field("temperature", encoding=Encoding.NUMERIC, similar_within=2.0)
hybi.compose.NumericScale
¶
Preset scales for NUMERIC field encoding.
These presets define "similar_within" values for common data types. The value represents the distance at which two numbers have ~60% similarity.
Usage
Field("price", encoding=Encoding.NUMERIC, similar_within=NumericScale.DOLLARS) Field("rating", encoding=Encoding.NUMERIC, similar_within=NumericScale.RATING_5)
How it works
similar_within=50 means values within 50 units have high similarity (~60%+). Values at exactly 50 apart have ~60% similarity. Values at 100 apart (2x) have ~14% similarity. Values at 150 apart (3x) have ~1% similarity.
Custom values
You can use any positive number: similar_within=100 for custom scales.
CENTS = 10
class-attribute
instance-attribute
¶
Price in cents: $0.10 difference = high similarity. For micro-transactions.
DOLLARS = 50
class-attribute
instance-attribute
¶
Price in dollars: $50 difference = high similarity. For consumer goods.
DOLLARS_LUXURY = 500
class-attribute
instance-attribute
¶
Price in dollars: $500 difference = high similarity. For luxury items.
RATING_5 = 0.5
class-attribute
instance-attribute
¶
5-star rating: 0.5 star difference = high similarity.
RATING_10 = 1.0
class-attribute
instance-attribute
¶
10-point rating: 1 point difference = high similarity.
RATING_100 = 10
class-attribute
instance-attribute
¶
100-point rating (percentages): 10 point difference = high similarity.
PERCENTAGE = 5
class-attribute
instance-attribute
¶
Percentage (0-100): 5 percentage points = high similarity.
FRACTION = 0.05
class-attribute
instance-attribute
¶
Fraction (0.0-1.0): 0.05 difference = high similarity.
TEMPERATURE_C = 2
class-attribute
instance-attribute
¶
Temperature in Celsius: 2°C difference = high similarity.
TEMPERATURE_F = 4
class-attribute
instance-attribute
¶
Temperature in Fahrenheit: 4°F difference = high similarity.
SMALL_COUNT = 1
class-attribute
instance-attribute
¶
Small counts (0-10): 1 unit difference = high similarity.
MEDIUM_COUNT = 10
class-attribute
instance-attribute
¶
Medium counts (0-100): 10 unit difference = high similarity.
LARGE_COUNT = 100
class-attribute
instance-attribute
¶
Large counts (0-1000): 100 unit difference = high similarity.
SECONDS = 5
class-attribute
instance-attribute
¶
Duration in seconds: 5 seconds = high similarity.
MINUTES = 1
class-attribute
instance-attribute
¶
Duration in minutes: 1 minute = high similarity.
HOURS = 0.5
class-attribute
instance-attribute
¶
Duration in hours: 30 minutes = high similarity.
DAYS = 1
class-attribute
instance-attribute
¶
Duration in days: 1 day = high similarity.
Temporal Encoding¶
TEMPORAL encoding produces proximity-preserving vectors for dates and timestamps: nearby dates are similar, distant dates are dissimilar.
The encoding resolution determines what "nearby" means. By default, resolution is day-level — events on the same day are nearly identical, events a week apart are dissimilar.
For finer or coarser resolution, set similar_within using TemporalResolution presets:
from hybi.compose import Field, Encoding, TemporalResolution
# Day resolution (default) — for filing dates, event dates
Field("created", encoding=Encoding.TEMPORAL)
# Hour resolution — for appointment times, shift logs
Field("scheduled", encoding=Encoding.TEMPORAL, similar_within=TemporalResolution.HOUR)
# Minute resolution — for transaction timestamps, sensor readings
Field("logged_at", encoding=Encoding.TEMPORAL, similar_within=TemporalResolution.MINUTE)
# Week resolution — for long-term trends, quarterly data
Field("quarter_start", encoding=Encoding.TEMPORAL, similar_within=TemporalResolution.WEEK)
At each resolution level, events one unit apart have high similarity (~0.9), and events several units apart are dissimilar:
| Resolution | 1 unit apart | 5 units apart | Example use case |
|---|---|---|---|
MINUTE |
~0.78 | dissimilar | Transaction logs |
HOUR |
~0.92 | dissimilar | Appointment scheduling |
DAY (default) |
~0.90 | dissimilar | Event tracking |
WEEK |
~0.90 | dissimilar | Trend analysis |
Accepted date formats: YYYY-MM-DD, YYYY-MM-DDTHH:MM:SS, YYYY-MM-DD HH:MM:SS, RFC 3339 with timezone, Unix epoch seconds as string.
hybi.compose.TemporalResolution
¶
Preset resolutions for TEMPORAL field encoding.
Controls the time scale at which dates are distinguishable. The value is the number of seconds per unit — the encoding divides epoch seconds by this to set the resolution.
Usage
Field("created", encoding=Encoding.TEMPORAL) # default: day Field("logged", encoding=Encoding.TEMPORAL, similar_within=TemporalResolution.HOUR) Field("traded", encoding=Encoding.TEMPORAL, similar_within=TemporalResolution.MINUTE)
MINUTE = 60
class-attribute
instance-attribute
¶
Minute resolution: events minutes apart are distinguishable.
HOUR = 3600
class-attribute
instance-attribute
¶
Hour resolution: events hours apart are distinguishable.
DAY = 86400
class-attribute
instance-attribute
¶
Day resolution (default): events days apart are distinguishable.
WEEK = 604800
class-attribute
instance-attribute
¶
Week resolution: events weeks apart are distinguishable.
Field Weights¶
Weights control relative importance in similarity calculations:
from hybi.compose import Bundle, Field, Encoding
schema = Bundle(
fields={
"title": Field(encoding=Encoding.SEMANTIC, weight=2.0), # 2x importance
"description": Field(encoding=Encoding.SEMANTIC, weight=1.0),
"category": Field(encoding=Encoding.EXACT, weight=0.5), # Half weight
}
)
When searching, higher-weighted fields contribute more to the similarity score.
Filter Mode¶
By default, all fields use mode="rank" — their similarity contributes to the final weighted score. Setting mode="filter" changes a field to act as a hard gate: candidates below the threshold are eliminated before ranking.
from hybi.compose import Bundle, Field, Encoding
schema = Bundle(
fields={
# Filter: only exact-match categories survive
"category": Field(encoding=Encoding.EXACT, mode="filter", threshold=0.5),
# Rank: survivors scored by description similarity
"description": Field(encoding=Encoding.SEMANTIC, weight=1.0),
}
)
Filter fields do not contribute to the ranking score — they only gate candidates. Multiple filter fields are conjunctive (AND): a candidate must pass all filters to be ranked.
| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
str |
"rank" |
"filter" eliminates candidates below threshold; "rank" contributes to score |
threshold |
float |
0.0 |
Minimum similarity for filter mode (0.0 to 1.0) |
Filter mode can be set at schema level (applies to all queries) or overridden per query in search_slots():
# Schema-level default: category always filters
results = q.search_slots({
"category": "physics", # Uses schema mode="filter"
"description": "quantum entanglement",
})
# Query-level override: rank on category instead
results = q.search_slots({
"category": {"query": "physics", "mode": "rank", "weight": 2.0},
"description": "quantum entanglement",
})
Encoding Selection Guide¶
flowchart LR
Q1{Categorical/ID?} -->|Yes| EXACT
Q1 -->|No| Q2{Numeric?}
Q2 -->|Yes| NUMERIC
Q2 -->|No| Q3{Date/Time?}
Q3 -->|Yes| TEMPORAL
Q3 -->|No| SEMANTIC
Examples¶
| Field Type | Example Values | Encoding | Notes |
|---|---|---|---|
| Product name | "MacBook Pro", "iPhone" | SEMANTIC | |
| Category | "electronics", "clothing" | EXACT | |
| Price | 999.99, 49.99 | NUMERIC | similar_within=NumericScale.DOLLARS |
| Description | "Lightweight laptop..." | SEMANTIC | |
| User ID | "user_12345" | EXACT | |
| Rating | 4.5, 3.0 | NUMERIC | similar_within=NumericScale.RATING_5 |
| Created date | "2024-01-15" | TEMPORAL | Day resolution by default |
| Logged time | "2024-01-15T10:30:00" | TEMPORAL | similar_within=TemporalResolution.MINUTE |
| Relationship type | "works_at", "knows" | EXACT |
FieldPath¶
The FieldPath dataclass represents a path to a field within a nested schema. It's returned by schema.resolve_field() and used internally for field resolution.
from hybi.compose.fields import FieldPath
# FieldPath for a nested field
path = schema.resolve_field("subject_type")
print(path.parts) # ("subject", "left")
print(path.column_name) # "subject_type"
print(path.dot_notation) # "subject.left"
print(path.root_slot) # "subject"
print(path.is_nested) # True
Properties¶
| Property | Type | Description |
|---|---|---|
parts |
Tuple[str, ...] |
Tuple of slot names from root to field |
column_name |
str |
The DataFrame column name |
field |
Field |
The Field configuration |
dot_notation |
str |
Path as dot-separated string |
root_slot |
str |
The top-level slot name |
is_nested |
bool |
True if path has multiple parts |
hybi.compose.fields.FieldPath
dataclass
¶
Represents a path to a field within a nested schema.
For simple schemas: FieldPath(("subject",), "entity", field) For nested schemas: FieldPath(("subject", "left"), "subject_type", field)
This enables queries using column names (what users see in DataFrames) rather than requiring knowledge of the internal slot structure.
Attributes:
| Name | Type | Description |
|---|---|---|
parts |
Tuple[str, ...]
|
Tuple of slot names forming the path from root to leaf |
column_name |
str
|
The DataFrame column name this path maps to |
field |
Field
|
The Field configuration at this path |
Example
schema = Triple( ... subject=Pair( ... left=Field("subject_type"), ... right=Field("subject_name"), ... ), ... predicate=Field("relation"), ... object=Field("target"), ... ) field_map = schema.get_field_map() field_map["subject_type"] FieldPath(parts=("subject", "left"), column_name="subject_type", field=Field(...))
parts
instance-attribute
¶
Tuple of slot names from root to this field.
column_name
instance-attribute
¶
The DataFrame column name this path maps to.
field
instance-attribute
¶
The Field configuration at this path.
dot_notation
property
¶
Return the path as dot-separated string: 'subject.left'.
root_slot
property
¶
Return the top-level slot name.
is_nested
property
¶
Whether this is a nested path (more than one part).
SchemaEvolution¶
The SchemaEvolution enum controls how schema changes are handled during subsequent ingest operations on a collection.
When using ADAPTIVE mode, you can optionally suppress evolution warnings for bulk/backfill jobs.
from hybi.compose import SchemaEvolution
hb.ingest(df, collection="users", schema=schema, evolution=SchemaEvolution.STRICT)
# ADAPTIVE mode with warnings suppressed
hb.ingest(
df,
collection="users",
schema=schema,
evolution=SchemaEvolution.ADAPTIVE,
warn_schema_evolution=False,
)
| Mode | Behavior |
|---|---|
ADAPTIVE |
Default. Additive changes allowed with warnings. New fields auto-added. |
STRICT |
No changes allowed without explicit migration. Raises error on schema change. |
LOCKED |
Immutable schema. Even additive changes are blocked after first ingest. |
| Parameter | Type | Default | Description |
|---|---|---|---|
warn_schema_evolution |
bool |
True |
In ADAPTIVE mode, emit schema evolution warnings when additive changes are applied. Set to False to suppress warning noise in controlled pipelines. |
hybi.compose.SchemaEvolution
¶
Bases: Enum
Controls how schema changes are handled for a collection.
Schema evolution mode determines whether changes to the schema (adding fields, changing encodings, etc.) are allowed during subsequent ingest operations.
Example
hb.ingest(df, collection="users", schema=schema, ... evolution=SchemaEvolution.STRICT)
ADAPTIVE = 'adaptive'
class-attribute
instance-attribute
¶
Default mode: additive changes allowed with warnings.
- New fields in data are automatically added to schema
- Breaking changes (removing fields) require explicit flag
- Encoding changes require explicit migration
- Best for: development, exploration, evolving data
STRICT = 'strict'
class-attribute
instance-attribute
¶
No changes allowed without explicit migration.
- Any schema change raises an error
- Requires explicit migration for changes
- Best for: production, regulated environments
LOCKED = 'locked'
class-attribute
instance-attribute
¶
Immutable schema - no changes allowed at all.
- Even additive changes are blocked
- Schema is frozen after first ingest
- Best for: audit logs, compliance, immutable data