Chunkers¶
A Chunker decomposes a source document into a stream of Chunk objects. The Document compound uses its configured chunker at ingest time to expand a document-level DataFrame into per-chunk Rows.
Protocol¶
Any object with the following method conforms — subclassing is not required:
def chunk(
document_id: str,
source: str,
metadata: Optional[Dict[str, Any]] = None,
) -> Iterable[Chunk]: ...
Implementations must yield chunks in a consistent traversal order so that sibling_index values are meaningful. chunk_id should be derived via derive_chunk_id(document_id, path) unless the chunker has a compelling reason to override.
Built-in Chunkers¶
FlatChunker¶
Depth-1: one root + N paragraph leaves. Splits on blank lines (\n\n). If window is set, paragraphs longer than window characters are further split into fixed-width slices.
from hybi.compose.chunkers import FlatChunker
chunker = FlatChunker() # pure paragraph split
chunker = FlatChunker(window=2000) # cap each leaf at 2000 chars
Suitable for transcripts, OCR, or any source without meaningful hierarchical structure.
MarkdownChunker¶
Heading-depth: ATX headings (# h1, ## h2, ...) become interior nodes; the paragraphs that follow a heading become its leaf children.
Path scheme: slugified heading text joined by /. Paragraphs under a heading get /pN suffixes; top-level paragraphs (before any heading) attach to the root as /pN.
Example — # Intro followed by a paragraph produces:
Chunk¶
One decomposed piece of a document. The structural fields below become Row columns on ingest.
| Field | Meaning |
|---|---|
chunk_id |
Stable primary key (sha256(document_id:path)[:16]). |
document_id |
Identifier of the source document. |
path |
Hierarchical address, e.g. / for root, /ch1/sec2/p3. |
parent_id |
chunk_id of the parent, or None for the root. |
sibling_index |
Position among siblings under the same parent (0-based). |
depth |
Distance from the root (root is 0). |
content |
The chunk's text payload; hit by semantic search. |
metadata |
Free-form per-chunk metadata (e.g. section_title). |
hybi.compose.chunkers
¶
Chunk dataclass, chunk_id derivation, and the Chunker strategy protocol.
Chunk
dataclass
¶
One decomposed piece of a document.
chunk_id — stable primary key (derive_chunk_id(document_id, path)). document_id — identifier of the source document the chunk belongs to. path — hierarchical address, e.g. "/" for root, "/ch1/sec2/p3". parent_id — chunk_id of the parent chunk, or None for the root. sibling_index — position among siblings under the same parent (0-based). depth — distance from the root (root is 0). content — the chunk's text payload; hit by semantic search. metadata — free-form per-chunk metadata (e.g. section_title).
Chunker
¶
Bases: Protocol
Strategy for decomposing a source document into chunks.
Implementations must yield Chunks in a consistent traversal order so that sibling_index values are meaningful. chunk_id SHOULD be derived via derive_chunk_id(document_id, path) unless the chunker has a compelling reason to override.
FlatChunker
¶
Depth-1 chunker: one root + N paragraph leaves.
Splits on blank lines (\n\n). If window is set, paragraphs
longer than window characters are further split into fixed-width
slices so no leaf exceeds the window. Suitable for transcripts, OCR,
or any source without meaningful hierarchical structure.
MarkdownChunker
¶
Heading-depth chunker: headings become interior nodes; the paragraphs that follow a heading become its leaf children.
Path scheme: slugified heading text joined by '/'. Paragraphs under a heading get '/pN' suffixes; top-level paragraphs (before any heading) attach to the root as '/pN'.
Example
# Intro followed by a paragraph produces:
/intro (heading, depth 1)
/intro/p0 (paragraph, depth 2)
derive_chunk_id(document_id, path)
¶
Deterministic chunk_id: sha256(document_id:path)[:16].
Idempotent — re-ingesting the same document with the same chunker produces identical chunk_ids, enabling dedup and overwrite-by-document semantics.