Skip to content

Validation

Schema validation utilities for ensuring data matches your compose schemas.

Overview

The validation module provides tools to:

  • Verify DataFrame columns match schema fields
  • Check encoding-specific type constraints
  • Infer schemas from DataFrames automatically
  • Validate queries against schema capabilities

Quick Start

from hybi.compose import Triple, Field, validate_schema_against_dataframe
import pandas as pd

# Define schema
schema = Triple(
    subject=Field("person"),
    predicate=Field("relation"),
    object=Field("company"),
)

# Check if DataFrame matches
df = pd.DataFrame({
    "person": ["Alice", "Bob"],
    "relation": ["works_at", "works_at"],
    "company": ["Acme", "Globex"],
})

errors = validate_schema_against_dataframe(schema, df)
if errors:
    for e in errors:
        print(f"{e.severity}: {e.field} - {e.message}")
else:
    print("Validation passed!")

ValidationResult

Result object returned by validation functions.

from hybi.compose import ValidationResult

hybi.compose.ValidationResult dataclass

Result of schema validation.

field instance-attribute

Field name that was validated.

message instance-attribute

Description of the issue or success.

severity = 'error' class-attribute instance-attribute

Severity level: 'error' or 'warning'.

is_error property

is_warning property


ValidationStrategy

Control how thoroughly validation checks data.

from hybi.compose import ValidationStrategy
Strategy Description Speed
FULL Check every row Slowest, most thorough
SAMPLED Check random sample (default) Good balance
SCHEMA_ONLY Only check column existence Fastest

hybi.compose.ValidationStrategy

Bases: Enum

Strategy for validating schema against DataFrame.

Different strategies trade off thoroughness vs performance.

FULL = 'full' class-attribute instance-attribute

Check every row (current behavior). Most thorough but slowest.

SAMPLED = 'sampled' class-attribute instance-attribute

Check a random sample of rows. Good balance of speed and coverage.

SCHEMA_ONLY = 'schema_only' class-attribute instance-attribute

Only check column existence. Fastest but least thorough.


ValidationConfig

Configuration for validation behavior.

from hybi.compose import ValidationConfig, ValidationStrategy

# Fast validation for large datasets
config = ValidationConfig(
    strategy=ValidationStrategy.SAMPLED,
    sample_size=1000,
)

# Thorough validation for critical data
config = ValidationConfig(
    strategy=ValidationStrategy.FULL,
)

hybi.compose.ValidationConfig dataclass

Configuration for schema validation.

Example

config = ValidationConfig( ... strategy=ValidationStrategy.SAMPLED, ... sample_size=1000, ... ) errors = validate_schema_against_dataframe(schema, df, config=config)

strategy = ValidationStrategy.SAMPLED class-attribute instance-attribute

Validation strategy to use. Default: SAMPLED for good balance.

sample_size = 1000 class-attribute instance-attribute

Number of rows to sample for SAMPLED strategy. Default: 1000.

sample_seed = 42 class-attribute instance-attribute

Random seed for reproducible sampling. Default: 42.


Validation Functions

validate_schema_against_dataframe

Main validation function that checks schema compatibility.

from hybi.compose import (
    Triple, Field, Encoding,
    validate_schema_against_dataframe,
    ValidationConfig,
    ValidationStrategy,
)

schema = Triple(
    subject=Field("entity", encoding=Encoding.SEMANTIC),
    predicate=Field("relation", encoding=Encoding.EXACT),
    object=Field("target", encoding=Encoding.SEMANTIC),
)

# Basic validation (uses SAMPLED strategy by default)
errors = validate_schema_against_dataframe(schema, df)

# Fast validation for large datasets
errors = validate_schema_against_dataframe(
    schema, df,
    config=ValidationConfig(strategy=ValidationStrategy.SCHEMA_ONLY),
)

# Thorough validation
errors = validate_schema_against_dataframe(
    schema, df,
    config=ValidationConfig(strategy=ValidationStrategy.FULL),
)

Checks performed:

  1. Column existence: All schema field names exist as DataFrame columns
  2. Null values: Required fields have no null values (SAMPLED/FULL)
  3. Type constraints: NUMERIC fields are numeric, TEMPORAL fields are datetime (SAMPLED/FULL)

hybi.compose.validate_schema_against_dataframe(schema, df, strict=True, config=None)

Validate that a schema matches a DataFrame.

Checks: - All schema field names exist as DataFrame columns - Required fields have no null values (if strategy allows) - Encoding-specific type constraints (if strategy allows)

Parameters:

Name Type Description Default
schema BaseMolecule

The molecule schema to validate

required
df DataFrame

DataFrame to validate against

required
strict bool

If True (default), all errors are blocking. If False, collect all errors without stopping.

True
config Optional[ValidationConfig]

Validation configuration. Default uses SAMPLED strategy with 1000 rows for good balance of speed and coverage.

None

Returns:

Type Description
List[ValidationResult]

List of ValidationResult objects (empty if valid)

Example

schema = Triple( ... subject=Field("person"), ... predicate=Field("relation"), ... object=Field("company"), ... ) errors = validate_schema_against_dataframe(schema, df) if errors: ... for e in errors: ... print(f"{e.severity}: {e.message}")


validate_query_for_schema

Validate that a query is valid for a given schema type.

from hybi.compose import Triple, Field, validate_query_for_schema

schema = Triple(
    subject=Field("entity"),
    predicate=Field("relation"),
    object=Field("target"),
)

# Valid queries
validate_query_for_schema(schema, "find", entity="Alice")  # OK
validate_query_for_schema(schema, "search", top_k=10)       # OK

# Invalid query method
validate_query_for_schema(schema, "at", position=5)
# Raises SchemaError: 'at' not supported by Triple

hybi.compose.validate_query_for_schema(schema, query_method, **kwargs)

Validate that a query is valid for a schema.

Checks: - Query method is supported by the schema type - Field names in kwargs are valid for the schema (including nested fields)

Parameters:

Name Type Description Default
schema BaseMolecule

The molecule schema

required
query_method str

Name of the query method (e.g., 'find', 'traverse')

required
**kwargs Any

Query parameters (field values, options)

{}

Raises:

Type Description
SchemaError

If query method is not supported by schema

SlotError

If query uses invalid field names

AmbiguousFieldError

If field name appears in multiple nested paths

Example

schema = Triple(subject=Field("entity"), predicate=Field("relation"), object=Field("target")) validate_query_for_schema(schema, "find", entity="Alice") # OK - by column name validate_query_for_schema(schema, "find", subject="Alice") # OK - by slot name validate_query_for_schema(schema, "at", position=5) # Raises SchemaError

For nested schemas

schema = Triple( ... subject=Pair(left=Field("subject_type"), right=Field("subject_name")), ... predicate=Field("relation"), ... object=Field("target"), ... ) validate_query_for_schema(schema, "find", subject_type="Person") # OK - nested field


infer_bundle_schema

Automatically infer a Bundle schema from a DataFrame.

from hybi.compose import infer_bundle_schema
import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol"],
    "age": [30, 25, 35],
    "department": ["Engineering", "Engineering", "Design"],
    "hire_date": pd.to_datetime(["2020-01-15", "2021-03-20", "2019-08-10"]),
})

schema = infer_bundle_schema(df)

# Inferred encodings:
# - name: SEMANTIC (high cardinality string)
# - age: NUMERIC (numeric type)
# - department: EXACT (low cardinality - 2 unique values)
# - hire_date: TEMPORAL (datetime type)

Inference rules:

Column Type Cardinality Inferred Encoding
Numeric - NUMERIC
Datetime - TEMPORAL
Categorical - EXACT
Object (string) < 50 unique, < 10% of rows EXACT
Object (string) Otherwise SEMANTIC

hybi.compose.infer_bundle_schema(df)

Infer a Bundle schema from a DataFrame.

Creates a Bundle schema by inferring appropriate encodings based on column types.

Parameters:

Name Type Description Default
df DataFrame

DataFrame to infer schema from

required

Returns:

Type Description
Bundle

Bundle schema with inferred field configurations

Example

schema = infer_bundle_schema(df) print(schema.slots()) # Column names


collect_validation_errors

Convenience function that returns None if valid, or combined error message.

from hybi.compose import collect_validation_errors

error_message = collect_validation_errors(schema, df)
if error_message:
    raise ValueError(f"Schema validation failed: {error_message}")

hybi.compose.collect_validation_errors(schema, df)

Validate schema and return error message if invalid.

Convenience function that returns None if valid, or a combined error message if validation fails.

Parameters:

Name Type Description Default
schema BaseMolecule

Schema to validate

required
df DataFrame

DataFrame to validate against

required

Returns:

Type Description
Optional[str]

None if valid, error message string if invalid


Error Handling

Validation functions return ValidationResult objects rather than raising exceptions, allowing you to collect all errors before deciding how to handle them.

errors = validate_schema_against_dataframe(schema, df)

# Separate errors and warnings
critical = [e for e in errors if e.is_error]
warnings = [e for e in errors if e.is_warning]

if critical:
    # Handle blocking errors
    for e in critical:
        print(f"ERROR: {e.field} - {e.message}")
    raise ValueError("Validation failed")

if warnings:
    # Log warnings but proceed
    for w in warnings:
        print(f"WARNING: {w.field} - {w.message}")

For validation errors raised as exceptions, see:


Best Practices

1. Validate Early

Validate schemas against sample data before ingesting large datasets:

# Validate with small sample first
sample = df.head(1000)
errors = validate_schema_against_dataframe(schema, sample)
if errors:
    raise ValueError(f"Schema mismatch: {errors}")

# Then ingest full dataset
hb.ingest(df, collection="data", schema=schema)

2. Choose Appropriate Strategy

# Development: Use FULL for thoroughness
config = ValidationConfig(strategy=ValidationStrategy.FULL)

# Production with large data: Use SAMPLED
config = ValidationConfig(
    strategy=ValidationStrategy.SAMPLED,
    sample_size=10000,  # Larger sample for production
)

# Quick checks: Use SCHEMA_ONLY
config = ValidationConfig(strategy=ValidationStrategy.SCHEMA_ONLY)

3. Use Inferred Schemas as Starting Point

# Start with inferred schema
schema = infer_bundle_schema(df)

# Then customize as needed
# (convert to explicit schema definition for production)