Chunking API¶
AI4RAGChunk¶
AI4RAGChunk dataclass ¶
Framework-agnostic chunk representation used across the ai4rag pipeline.
Parameters:
-
text(str) –The textual content of the chunk.
-
metadata(dict[str, Any], default:dict()) –Chunk metadata. Expected keys include
document_idandsequence_number; additional keys (headings, provenance) are chunker-dependent.
BaseChunker¶
BaseChunker ¶
Bases: ABC
Responsible for handling splitting document operations in the RAG application.
All chunkers accept DoclingDocument as input and produce AI4RAGChunk as output.
Functions¶
split_documents abstractmethod ¶
Split series of documents into smaller parts based on the provided chunker settings.
Parameters:
-
documents(Sequence[DoclingDocument]) –Parsed docling documents to chunk.
Returns:
-
list[AI4RAGChunk]–List of chunks produced from the input documents.
Source code in ai4rag/rag/chunking/base_chunker.py
to_dict abstractmethod ¶
from_dict abstractmethod classmethod ¶
DoclingChunker¶
DoclingChunker ¶
DoclingChunker(
max_tokens: int = 8192, contextualize: bool = True, tokenizer: BaseTokenizer | None = None, merge_peers: bool = True
)
Bases: BaseChunker
Structure-aware, token-aware chunker wrapping docling's HybridChunker.
Operates directly on DoclingDocument objects, preserving document hierarchy (headings, tables, figures) during chunking. Chunks are bounded by a token limit aligned to the embedding model.
Parameters:
-
max_tokens(int, default:8192) –Maximum number of tokens per chunk.
-
contextualize(bool, default:True) –When
True, each chunk's text is enriched with its heading hierarchy viaHybridChunker.contextualize. This improves embedding quality at the cost of increased token usage. -
tokenizer(BaseTokenizer | None, default:None) –Tokenizer for token counting and split-point decisions. When
None, defaults to OpenAI tiktoken (cl100k_base, zero model downloads). -
merge_peers(bool, default:True) –Merge adjacent undersized chunks that share the same heading and caption context.
Source code in ai4rag/rag/chunking/docling_chunker.py
Functions¶
split_documents ¶
Split docling documents into token-bounded chunks.
Parameters:
-
documents(Sequence[DoclingDocument]) –Parsed documents to chunk.
Returns:
-
list[AI4RAGChunk]–Chunks with
document_id,sequence_number, and optional heading/provenance metadata.
Source code in ai4rag/rag/chunking/docling_chunker.py
to_dict ¶
Return dictionary that can be used to recreate an instance.
from_dict classmethod ¶
LangChainChunker¶
LangChainChunker ¶
LangChainChunker(
method: Literal["recursive", "character", "token"] = "recursive",
chunk_size: int = 2048,
chunk_overlap: int = 256,
**kwargs: Any
)
Bases: BaseChunker
Wrapper for LangChain TextSplitter operating on DoclingDocument input.
Converts each DoclingDocument to markdown internally, applies token-based splitting via tiktoken, and returns AI4RAGChunk objects.
Parameters:
-
method(Literal['recursive', 'character', 'token'], default:"recursive") –Describes the type of TextSplitter as the main instance performing the chunking.
-
chunk_size(int, default:2048) –Maximum number of tokens per chunk.
-
chunk_overlap(int, default:256) –Overlap in tokens between chunks.
Other Parameters:
-
separators(list[str]) –Separators between chunks.
Source code in ai4rag/rag/chunking/langchain_chunker.py
Functions¶
to_dict ¶
Return dictionary that can be used to recreate an instance of the LangChainChunker.
Source code in ai4rag/rag/chunking/langchain_chunker.py
from_dict classmethod ¶
split_documents ¶
Split docling documents into smaller chunks using token-based splitting.
Each DoclingDocument is first exported to markdown, then split using the configured TextSplitter. Results are returned as AI4RAGChunk.
Parameters:
-
documents(Sequence[DoclingDocument]) –Parsed docling documents to chunk.
Returns:
-
list[AI4RAGChunk]–Chunks with
document_id,sequence_number, andstart_indexmetadata.