Chunkers¶

A Chunker decomposes a source document into a stream of Chunk objects. The Document compound uses its configured chunker at ingest time to expand a document-level DataFrame into per-chunk Rows.

Protocol¶

Any object with the following method conforms — subclassing is not required:

def chunk(
    document_id: str,
    source: str,
    metadata: Optional[Dict[str, Any]] = None,
) -> Iterable[Chunk]: ...

Implementations must yield chunks in a consistent traversal order so that sibling_index values are meaningful. chunk_id should be derived via derive_chunk_id(document_id, path) unless the chunker has a compelling reason to override.

Built-in Chunkers¶

FlatChunker¶

Depth-1: one root + N paragraph leaves. Splits on blank lines (\n\n). If window is set, paragraphs longer than window characters are further split into fixed-width slices.

from hybi.compose.chunkers import FlatChunker

chunker = FlatChunker()                 # pure paragraph split
chunker = FlatChunker(window=2000)      # cap each leaf at 2000 chars

Suitable for transcripts, OCR, or any source without meaningful hierarchical structure.

MarkdownChunker¶

Heading-depth: ATX headings (# h1, ## h2, ...) become interior nodes; the paragraphs that follow a heading become its leaf children.

from hybi.compose.chunkers import MarkdownChunker

chunker = MarkdownChunker()

Path scheme: slugified heading text joined by /. Paragraphs under a heading get /pN suffixes; top-level paragraphs (before any heading) attach to the root as /pN.

Example — # Intro followed by a paragraph produces:

/intro            (heading, depth 1)
/intro/p0         (paragraph, depth 2)

Chunk¶

One decomposed piece of a document. The structural fields below become Row columns on ingest.

Field	Meaning
`chunk_id`	Stable primary key (`sha256(document_id:path)[:16]`).
`document_id`	Identifier of the source document.
`path`	Hierarchical address, e.g. `/` for root, `/ch1/sec2/p3`.
`parent_id`	`chunk_id` of the parent, or `None` for the root.
`sibling_index`	Position among siblings under the same parent (0-based).
`depth`	Distance from the root (root is 0).
`content`	The chunk's text payload; hit by semantic search.
`metadata`	Free-form per-chunk metadata (e.g. `section_title`).

`hybi.compose.chunkers` ¶

Chunk dataclass, chunk_id derivation, and the Chunker strategy protocol.

`Chunk` `dataclass` ¶

One decomposed piece of a document.

chunk_id — stable primary key (derive_chunk_id(document_id, path)). document_id — identifier of the source document the chunk belongs to. path — hierarchical address, e.g. "/" for root, "/ch1/sec2/p3". parent_id — chunk_id of the parent chunk, or None for the root. sibling_index — position among siblings under the same parent (0-based). depth — distance from the root (root is 0). content — the chunk's text payload; hit by semantic search. metadata — free-form per-chunk metadata (e.g. section_title).

`Chunker` ¶

Bases: Protocol

Strategy for decomposing a source document into chunks.

Implementations must yield Chunks in a consistent traversal order so that sibling_index values are meaningful. chunk_id SHOULD be derived via derive_chunk_id(document_id, path) unless the chunker has a compelling reason to override.

`FlatChunker` ¶

Depth-1 chunker: one root + N paragraph leaves.

Splits on blank lines (\n\n). If window is set, paragraphs longer than window characters are further split into fixed-width slices so no leaf exceeds the window. Suitable for transcripts, OCR, or any source without meaningful hierarchical structure.

`MarkdownChunker` ¶

Heading-depth chunker: headings become interior nodes; the paragraphs that follow a heading become its leaf children.

Path scheme: slugified heading text joined by '/'. Paragraphs under a heading get '/pN' suffixes; top-level paragraphs (before any heading) attach to the root as '/pN'.

Example

# Intro followed by a paragraph produces: /intro (heading, depth 1) /intro/p0 (paragraph, depth 2)

`derive_chunk_id(document_id, path)` ¶

Deterministic chunk_id: sha256(document_id:path)[:16].

Idempotent — re-ingesting the same document with the same chunker produces identical chunk_ids, enabling dedup and overwrite-by-document semantics.

Chunkers¶

Protocol¶

Built-in Chunkers¶

FlatChunker¶

MarkdownChunker¶

Chunk¶

hybi.compose.chunkers ¶

Chunk dataclass ¶

Chunker ¶

FlatChunker ¶

MarkdownChunker ¶

derive_chunk_id(document_id, path) ¶