Skip to content

Chunkers

A Chunker decomposes a source document into a stream of Chunk objects. The Document compound uses its configured chunker at ingest time to expand a document-level DataFrame into per-chunk Rows.

Protocol

Any object with the following method conforms — subclassing is not required:

def chunk(
    document_id: str,
    source: str,
    metadata: Optional[Dict[str, Any]] = None,
) -> Iterable[Chunk]: ...

Implementations must yield chunks in a consistent traversal order so that sibling_index values are meaningful. chunk_id should be derived via derive_chunk_id(document_id, path) unless the chunker has a compelling reason to override.

Built-in Chunkers

FlatChunker

Depth-1: one root + N paragraph leaves. Splits on blank lines (\n\n). If window is set, paragraphs longer than window characters are further split into fixed-width slices.

from hybi.compose.chunkers import FlatChunker

chunker = FlatChunker()                 # pure paragraph split
chunker = FlatChunker(window=2000)      # cap each leaf at 2000 chars

Suitable for transcripts, OCR, or any source without meaningful hierarchical structure.

MarkdownChunker

Heading-depth: ATX headings (# h1, ## h2, ...) become interior nodes; the paragraphs that follow a heading become its leaf children.

from hybi.compose.chunkers import MarkdownChunker

chunker = MarkdownChunker()

Path scheme: slugified heading text joined by /. Paragraphs under a heading get /pN suffixes; top-level paragraphs (before any heading) attach to the root as /pN.

Example — # Intro followed by a paragraph produces:

/intro            (heading, depth 1)
/intro/p0         (paragraph, depth 2)

Chunk

One decomposed piece of a document. The structural fields below become Row columns on ingest.

Field Meaning
chunk_id Stable primary key (sha256(document_id:path)[:16]).
document_id Identifier of the source document.
path Hierarchical address, e.g. / for root, /ch1/sec2/p3.
parent_id chunk_id of the parent, or None for the root.
sibling_index Position among siblings under the same parent (0-based).
depth Distance from the root (root is 0).
content The chunk's text payload; hit by semantic search.
metadata Free-form per-chunk metadata (e.g. section_title).

hybi.compose.chunkers

Chunk dataclass, chunk_id derivation, and the Chunker strategy protocol.

Chunk dataclass

One decomposed piece of a document.

chunk_id — stable primary key (derive_chunk_id(document_id, path)). document_id — identifier of the source document the chunk belongs to. path — hierarchical address, e.g. "/" for root, "/ch1/sec2/p3". parent_id — chunk_id of the parent chunk, or None for the root. sibling_index — position among siblings under the same parent (0-based). depth — distance from the root (root is 0). content — the chunk's text payload; hit by semantic search. metadata — free-form per-chunk metadata (e.g. section_title).

Chunker

Bases: Protocol

Strategy for decomposing a source document into chunks.

Implementations must yield Chunks in a consistent traversal order so that sibling_index values are meaningful. chunk_id SHOULD be derived via derive_chunk_id(document_id, path) unless the chunker has a compelling reason to override.

FlatChunker

Depth-1 chunker: one root + N paragraph leaves.

Splits on blank lines (\n\n). If window is set, paragraphs longer than window characters are further split into fixed-width slices so no leaf exceeds the window. Suitable for transcripts, OCR, or any source without meaningful hierarchical structure.

MarkdownChunker

Heading-depth chunker: headings become interior nodes; the paragraphs that follow a heading become its leaf children.

Path scheme: slugified heading text joined by '/'. Paragraphs under a heading get '/pN' suffixes; top-level paragraphs (before any heading) attach to the root as '/pN'.

Example

# Intro followed by a paragraph produces: /intro (heading, depth 1) /intro/p0 (paragraph, depth 2)

derive_chunk_id(document_id, path)

Deterministic chunk_id: sha256(document_id:path)[:16].

Idempotent — re-ingesting the same document with the same chunker produces identical chunk_ids, enabling dedup and overwrite-by-document semantics.