Validation¶
Schema validation utilities for ensuring data matches your compose schemas.
Overview¶
The validation module provides tools to:
- Verify DataFrame columns match schema fields
- Check encoding-specific type constraints
- Infer schemas from DataFrames automatically
- Validate queries against schema capabilities
Quick Start¶
from hybi.compose import Triple, Field, validate_schema_against_dataframe
import pandas as pd
# Define schema
schema = Triple(
subject=Field("person"),
predicate=Field("relation"),
object=Field("company"),
)
# Check if DataFrame matches
df = pd.DataFrame({
"person": ["Alice", "Bob"],
"relation": ["works_at", "works_at"],
"company": ["Acme", "Globex"],
})
errors = validate_schema_against_dataframe(schema, df)
if errors:
for e in errors:
print(f"{e.severity}: {e.field} - {e.message}")
else:
print("Validation passed!")
ValidationResult¶
Result object returned by validation functions.
hybi.compose.ValidationResult
dataclass
¶
ValidationStrategy¶
Control how thoroughly validation checks data.
| Strategy | Description | Speed |
|---|---|---|
FULL |
Check every row | Slowest, most thorough |
SAMPLED |
Check random sample (default) | Good balance |
SCHEMA_ONLY |
Only check column existence | Fastest |
hybi.compose.ValidationStrategy
¶
Bases: Enum
Strategy for validating schema against DataFrame.
Different strategies trade off thoroughness vs performance.
FULL = 'full'
class-attribute
instance-attribute
¶
Check every row (current behavior). Most thorough but slowest.
SAMPLED = 'sampled'
class-attribute
instance-attribute
¶
Check a random sample of rows. Good balance of speed and coverage.
SCHEMA_ONLY = 'schema_only'
class-attribute
instance-attribute
¶
Only check column existence. Fastest but least thorough.
ValidationConfig¶
Configuration for validation behavior.
from hybi.compose import ValidationConfig, ValidationStrategy
# Fast validation for large datasets
config = ValidationConfig(
strategy=ValidationStrategy.SAMPLED,
sample_size=1000,
)
# Thorough validation for critical data
config = ValidationConfig(
strategy=ValidationStrategy.FULL,
)
hybi.compose.ValidationConfig
dataclass
¶
Configuration for schema validation.
Example
config = ValidationConfig( ... strategy=ValidationStrategy.SAMPLED, ... sample_size=1000, ... ) errors = validate_schema_against_dataframe(schema, df, config=config)
strategy = ValidationStrategy.SAMPLED
class-attribute
instance-attribute
¶
Validation strategy to use. Default: SAMPLED for good balance.
sample_size = 1000
class-attribute
instance-attribute
¶
Number of rows to sample for SAMPLED strategy. Default: 1000.
sample_seed = 42
class-attribute
instance-attribute
¶
Random seed for reproducible sampling. Default: 42.
Validation Functions¶
validate_schema_against_dataframe¶
Main validation function that checks schema compatibility.
from hybi.compose import (
Triple, Field, Encoding,
validate_schema_against_dataframe,
ValidationConfig,
ValidationStrategy,
)
schema = Triple(
subject=Field("entity", encoding=Encoding.SEMANTIC),
predicate=Field("relation", encoding=Encoding.EXACT),
object=Field("target", encoding=Encoding.SEMANTIC),
)
# Basic validation (uses SAMPLED strategy by default)
errors = validate_schema_against_dataframe(schema, df)
# Fast validation for large datasets
errors = validate_schema_against_dataframe(
schema, df,
config=ValidationConfig(strategy=ValidationStrategy.SCHEMA_ONLY),
)
# Thorough validation
errors = validate_schema_against_dataframe(
schema, df,
config=ValidationConfig(strategy=ValidationStrategy.FULL),
)
Checks performed:
- Column existence: All schema field names exist as DataFrame columns
- Null values: Required fields have no null values (SAMPLED/FULL)
- Type constraints: NUMERIC fields are numeric, TEMPORAL fields are datetime (SAMPLED/FULL)
hybi.compose.validate_schema_against_dataframe(schema, df, strict=True, config=None)
¶
Validate that a schema matches a DataFrame.
Checks: - All schema field names exist as DataFrame columns - Required fields have no null values (if strategy allows) - Encoding-specific type constraints (if strategy allows)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
BaseMolecule
|
The molecule schema to validate |
required |
df
|
DataFrame
|
DataFrame to validate against |
required |
strict
|
bool
|
If True (default), all errors are blocking. If False, collect all errors without stopping. |
True
|
config
|
Optional[ValidationConfig]
|
Validation configuration. Default uses SAMPLED strategy with 1000 rows for good balance of speed and coverage. |
None
|
Returns:
| Type | Description |
|---|---|
List[ValidationResult]
|
List of ValidationResult objects (empty if valid) |
Example
schema = Triple( ... subject=Field("person"), ... predicate=Field("relation"), ... object=Field("company"), ... ) errors = validate_schema_against_dataframe(schema, df) if errors: ... for e in errors: ... print(f"{e.severity}: {e.message}")
validate_query_for_schema¶
Validate that a query is valid for a given schema type.
from hybi.compose import Triple, Field, validate_query_for_schema
schema = Triple(
subject=Field("entity"),
predicate=Field("relation"),
object=Field("target"),
)
# Valid queries
validate_query_for_schema(schema, "find", entity="Alice") # OK
validate_query_for_schema(schema, "search", top_k=10) # OK
# Invalid query method
validate_query_for_schema(schema, "at", position=5)
# Raises SchemaError: 'at' not supported by Triple
hybi.compose.validate_query_for_schema(schema, query_method, **kwargs)
¶
Validate that a query is valid for a schema.
Checks: - Query method is supported by the schema type - Field names in kwargs are valid for the schema (including nested fields)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
BaseMolecule
|
The molecule schema |
required |
query_method
|
str
|
Name of the query method (e.g., 'find', 'traverse') |
required |
**kwargs
|
Any
|
Query parameters (field values, options) |
{}
|
Raises:
| Type | Description |
|---|---|
SchemaError
|
If query method is not supported by schema |
SlotError
|
If query uses invalid field names |
AmbiguousFieldError
|
If field name appears in multiple nested paths |
Example
schema = Triple(subject=Field("entity"), predicate=Field("relation"), object=Field("target")) validate_query_for_schema(schema, "find", entity="Alice") # OK - by column name validate_query_for_schema(schema, "find", subject="Alice") # OK - by slot name validate_query_for_schema(schema, "at", position=5) # Raises SchemaError
For nested schemas
schema = Triple( ... subject=Pair(left=Field("subject_type"), right=Field("subject_name")), ... predicate=Field("relation"), ... object=Field("target"), ... ) validate_query_for_schema(schema, "find", subject_type="Person") # OK - nested field
infer_bundle_schema¶
Automatically infer a Bundle schema from a DataFrame.
from hybi.compose import infer_bundle_schema
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol"],
"age": [30, 25, 35],
"department": ["Engineering", "Engineering", "Design"],
"hire_date": pd.to_datetime(["2020-01-15", "2021-03-20", "2019-08-10"]),
})
schema = infer_bundle_schema(df)
# Inferred encodings:
# - name: SEMANTIC (high cardinality string)
# - age: NUMERIC (numeric type)
# - department: EXACT (low cardinality - 2 unique values)
# - hire_date: TEMPORAL (datetime type)
Inference rules:
| Column Type | Cardinality | Inferred Encoding |
|---|---|---|
| Numeric | - | NUMERIC |
| Datetime | - | TEMPORAL |
| Categorical | - | EXACT |
| Object (string) | < 50 unique, < 10% of rows | EXACT |
| Object (string) | Otherwise | SEMANTIC |
hybi.compose.infer_bundle_schema(df)
¶
Infer a Bundle schema from a DataFrame.
Creates a Bundle schema by inferring appropriate encodings based on column types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame to infer schema from |
required |
Returns:
| Type | Description |
|---|---|
Bundle
|
Bundle schema with inferred field configurations |
Example
schema = infer_bundle_schema(df) print(schema.slots()) # Column names
collect_validation_errors¶
Convenience function that returns None if valid, or combined error message.
from hybi.compose import collect_validation_errors
error_message = collect_validation_errors(schema, df)
if error_message:
raise ValueError(f"Schema validation failed: {error_message}")
hybi.compose.collect_validation_errors(schema, df)
¶
Validate schema and return error message if invalid.
Convenience function that returns None if valid, or a combined error message if validation fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema
|
BaseMolecule
|
Schema to validate |
required |
df
|
DataFrame
|
DataFrame to validate against |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
None if valid, error message string if invalid |
Error Handling¶
Validation functions return ValidationResult objects rather than raising exceptions,
allowing you to collect all errors before deciding how to handle them.
errors = validate_schema_against_dataframe(schema, df)
# Separate errors and warnings
critical = [e for e in errors if e.is_error]
warnings = [e for e in errors if e.is_warning]
if critical:
# Handle blocking errors
for e in critical:
print(f"ERROR: {e.field} - {e.message}")
raise ValueError("Validation failed")
if warnings:
# Log warnings but proceed
for w in warnings:
print(f"WARNING: {w.field} - {w.message}")
For validation errors raised as exceptions, see:
- SchemaError - Query operation not supported
- SlotError - Invalid field name
- FieldValidationError - Field validation failed
Best Practices¶
1. Validate Early¶
Validate schemas against sample data before ingesting large datasets:
# Validate with small sample first
sample = df.head(1000)
errors = validate_schema_against_dataframe(schema, sample)
if errors:
raise ValueError(f"Schema mismatch: {errors}")
# Then ingest full dataset
hb.ingest(df, collection="data", schema=schema)
2. Choose Appropriate Strategy¶
# Development: Use FULL for thoroughness
config = ValidationConfig(strategy=ValidationStrategy.FULL)
# Production with large data: Use SAMPLED
config = ValidationConfig(
strategy=ValidationStrategy.SAMPLED,
sample_size=10000, # Larger sample for production
)
# Quick checks: Use SCHEMA_ONLY
config = ValidationConfig(strategy=ValidationStrategy.SCHEMA_ONLY)