Skip to content

Chunking API

AI4RAGChunk

AI4RAGChunk dataclass

AI4RAGChunk(text: str, metadata: dict[str, Any] = dict())

Framework-agnostic chunk representation used across the ai4rag pipeline.

Parameters:

  • text (str) –

    The textual content of the chunk.

  • metadata (dict[str, Any], default: dict() ) –

    Chunk metadata. Expected keys include document_id and sequence_number; additional keys (headings, provenance) are chunker-dependent.

BaseChunker

BaseChunker

Bases: ABC

Responsible for handling splitting document operations in the RAG application.

All chunkers accept DoclingDocument as input and produce AI4RAGChunk as output.

Functions

split_documents abstractmethod

split_documents(documents: Sequence[DoclingDocument]) -> list[AI4RAGChunk]

Split series of documents into smaller parts based on the provided chunker settings.

Parameters:

  • documents (Sequence[DoclingDocument]) –

    Parsed docling documents to chunk.

Returns:

  • list[AI4RAGChunk]

    List of chunks produced from the input documents.

Source code in ai4rag/rag/chunking/base_chunker.py
@abstractmethod
def split_documents(self, documents: Sequence[DoclingDocument]) -> list[AI4RAGChunk]:
    """
    Split series of documents into smaller parts based on
    the provided chunker settings.

    Parameters
    ----------
    documents : Sequence[DoclingDocument]
        Parsed docling documents to chunk.

    Returns
    -------
    list[AI4RAGChunk]
        List of chunks produced from the input documents.
    """

to_dict abstractmethod

to_dict() -> dict[str, Any]

Return dictionary that can be used to recreate an instance of the BaseChunker.

Source code in ai4rag/rag/chunking/base_chunker.py
@abstractmethod
def to_dict(self) -> dict[str, Any]:
    """Return dictionary that can be used to recreate an instance of the BaseChunker."""

from_dict abstractmethod classmethod

from_dict(d: dict[str, Any]) -> BaseChunker

Create an instance from the dictionary.

Source code in ai4rag/rag/chunking/base_chunker.py
@classmethod
@abstractmethod
def from_dict(cls, d: dict[str, Any]) -> "BaseChunker":
    """Create an instance from the dictionary."""

DoclingChunker

DoclingChunker

DoclingChunker(
    max_tokens: int = 8192, contextualize: bool = True, tokenizer: BaseTokenizer | None = None, merge_peers: bool = True
)

Bases: BaseChunker

Structure-aware, token-aware chunker wrapping docling's HybridChunker.

Operates directly on DoclingDocument objects, preserving document hierarchy (headings, tables, figures) during chunking. Chunks are bounded by a token limit aligned to the embedding model.

Parameters:

  • max_tokens (int, default: 8192 ) –

    Maximum number of tokens per chunk.

  • contextualize (bool, default: True ) –

    When True, each chunk's text is enriched with its heading hierarchy via HybridChunker.contextualize. This improves embedding quality at the cost of increased token usage.

  • tokenizer (BaseTokenizer | None, default: None ) –

    Tokenizer for token counting and split-point decisions. When None, defaults to OpenAI tiktoken (cl100k_base, zero model downloads).

  • merge_peers (bool, default: True ) –

    Merge adjacent undersized chunks that share the same heading and caption context.

Source code in ai4rag/rag/chunking/docling_chunker.py
def __init__(
    self,
    max_tokens: int = 8192,
    contextualize: bool = True,
    tokenizer: BaseTokenizer | None = None,
    merge_peers: bool = True,
) -> None:
    self.max_tokens = max_tokens
    self.contextualize = contextualize
    self.merge_peers = merge_peers

    if tokenizer is None:
        encoding = tiktoken.encoding_for_model(_DEFAULT_TIKTOKEN_MODEL)
        tokenizer = OpenAITokenizer(tokenizer=encoding, max_tokens=max_tokens)

    self._tokenizer = tokenizer
    self._chunker = HybridChunker(
        tokenizer=tokenizer,
        merge_peers=merge_peers,
    )

Functions

split_documents

split_documents(documents: Sequence[DoclingDocument]) -> list[AI4RAGChunk]

Split docling documents into token-bounded chunks.

Parameters:

  • documents (Sequence[DoclingDocument]) –

    Parsed documents to chunk.

Returns:

  • list[AI4RAGChunk]

    Chunks with document_id, sequence_number, and optional heading/provenance metadata.

Source code in ai4rag/rag/chunking/docling_chunker.py
def split_documents(self, documents: Sequence[DoclingDocument]) -> list[AI4RAGChunk]:
    """
    Split docling documents into token-bounded chunks.

    Parameters
    ----------
    documents : Sequence[DoclingDocument]
        Parsed documents to chunk.

    Returns
    -------
    list[AI4RAGChunk]
        Chunks with ``document_id``, ``sequence_number``, and
        optional heading/provenance metadata.
    """
    all_chunks: list[AI4RAGChunk] = []

    for doc in documents:
        doc_id = doc.name or hashlib.sha256(str(doc).encode()).hexdigest()[:16]
        seq_num = 0

        for chunk in self._chunker.chunk(doc):
            seq_num += 1

            text = self._chunker.contextualize(chunk) if self.contextualize else chunk.text

            metadata: dict[str, Any] = {
                "document_id": doc_id,
                "sequence_number": seq_num,
            }

            if chunk.meta.headings:
                metadata["headings"] = chunk.meta.headings

            all_chunks.append(AI4RAGChunk(text=text, metadata=metadata))

    return all_chunks

to_dict

to_dict() -> dict[str, Any]

Return dictionary that can be used to recreate an instance.

Source code in ai4rag/rag/chunking/docling_chunker.py
def to_dict(self) -> dict[str, Any]:
    """Return dictionary that can be used to recreate an instance."""
    return {
        "max_tokens": self.max_tokens,
        "contextualize": self.contextualize,
        "merge_peers": self.merge_peers,
    }

from_dict classmethod

from_dict(d: dict[str, Any]) -> DoclingChunker

Create an instance from the dictionary.

Source code in ai4rag/rag/chunking/docling_chunker.py
@classmethod
def from_dict(cls, d: dict[str, Any]) -> "DoclingChunker":
    """Create an instance from the dictionary."""
    return cls(**d)

LangChainChunker

LangChainChunker

LangChainChunker(
    method: Literal["recursive", "character", "token"] = "recursive",
    chunk_size: int = 2048,
    chunk_overlap: int = 256,
    **kwargs: Any
)

Bases: BaseChunker

Wrapper for LangChain TextSplitter operating on DoclingDocument input.

Converts each DoclingDocument to markdown internally, applies token-based splitting via tiktoken, and returns AI4RAGChunk objects.

Parameters:

  • method (Literal['recursive', 'character', 'token'], default: "recursive" ) –

    Describes the type of TextSplitter as the main instance performing the chunking.

  • chunk_size (int, default: 2048 ) –

    Maximum number of tokens per chunk.

  • chunk_overlap (int, default: 256 ) –

    Overlap in tokens between chunks.

Other Parameters:

  • separators (list[str]) –

    Separators between chunks.

Source code in ai4rag/rag/chunking/langchain_chunker.py
def __init__(
    self,
    method: Literal["recursive", "character", "token"] = "recursive",
    chunk_size: int = 2048,
    chunk_overlap: int = 256,
    **kwargs: Any,
) -> None:
    self.method = method
    self.chunk_size = chunk_size
    self.chunk_overlap = chunk_overlap
    self.separators = kwargs.pop("separators", ["\n\n", r"(?<=\. )", "\n", " ", ""])
    self._encoding = tiktoken.encoding_for_model(_DEFAULT_TIKTOKEN_MODEL)
    self._text_splitter = self._get_text_splitter()

Functions

to_dict

to_dict() -> dict[str, Any]

Return dictionary that can be used to recreate an instance of the LangChainChunker.

Source code in ai4rag/rag/chunking/langchain_chunker.py
def to_dict(self) -> dict[str, Any]:
    """
    Return dictionary that can be used to recreate an instance of the LangChainChunker.
    """
    params = (
        "method",
        "chunk_size",
        "chunk_overlap",
    )

    ret = {k: v for k, v in self.__dict__.items() if k in params}

    return ret

from_dict classmethod

from_dict(d: dict[str, Any]) -> LangChainChunker

Create an instance from the dictionary.

Source code in ai4rag/rag/chunking/langchain_chunker.py
@classmethod
def from_dict(cls, d: dict[str, Any]) -> "LangChainChunker":
    """Create an instance from the dictionary."""

    return cls(**d)

split_documents

split_documents(documents: Sequence[DoclingDocument]) -> list[AI4RAGChunk]

Split docling documents into smaller chunks using token-based splitting.

Each DoclingDocument is first exported to markdown, then split using the configured TextSplitter. Results are returned as AI4RAGChunk.

Parameters:

  • documents (Sequence[DoclingDocument]) –

    Parsed docling documents to chunk.

Returns:

  • list[AI4RAGChunk]

    Chunks with document_id, sequence_number, and start_index metadata.

Source code in ai4rag/rag/chunking/langchain_chunker.py
def split_documents(self, documents: Sequence[DoclingDocument]) -> list[AI4RAGChunk]:
    """
    Split docling documents into smaller chunks using token-based splitting.

    Each ``DoclingDocument`` is first exported to markdown, then split using
    the configured ``TextSplitter``. Results are returned as ``AI4RAGChunk``.

    Parameters
    ----------
    documents : Sequence[DoclingDocument]
        Parsed docling documents to chunk.

    Returns
    -------
    list[AI4RAGChunk]
        Chunks with ``document_id``, ``sequence_number``, and ``start_index`` metadata.
    """
    lc_docs = self._docling_to_langchain(documents)
    self._set_document_id_in_metadata_if_missing(lc_docs)
    chunks = self._text_splitter.split_documents(lc_docs)
    sorted_chunks = self._set_sequence_number_in_metadata(chunks)
    return [AI4RAGChunk(text=chunk.page_content, metadata=chunk.metadata) for chunk in sorted_chunks]