Chunking Strategies¶
Overview¶
Chunking is the process of intelligently splitting documents into optimal pieces for LLM processing. Docling Graph uses structure-aware chunking that preserves document semantics, tables, and hierarchies.
In this guide: - Why chunking matters - Structure-aware vs naive chunking - Real tokenizer integration - Token management with safety margins - Schema-aware chunking - Provider-specific optimization - Performance tuning
New: Real Tokenizer Integration
Docling Graph now uses real tokenizers for accurate token counting instead of character-based heuristics. This prevents context window overflows and enables more efficient chunk packing with a 20% safety margin.
Why Chunking Matters¶
The Context Window Problem¶
LLMs have limited context windows:
| Provider | Model | Context Limit |
|---|---|---|
| OpenAI | GPT-4 Turbo | 128K tokens |
| Mistral | Mistral Large | 32K tokens |
| Ollama | Llama 3.1 8B | 8K tokens |
| IBM | Granite 4.0 | 8K tokens |
Problem: Most documents exceed these limits.
Solution: Intelligent chunking.
Chunking Approaches¶
❌ Naive Chunking¶
# ❌ Bad - Breaks tables and structure
def naive_chunk(text, max_chars=1000):
return [text[i:i+max_chars] for i in range(0, len(text), max_chars)]
Problems: - Breaks tables mid-row - Splits lists - Ignores semantic boundaries - Loses context
✅ Structure-Aware Chunking¶
# ✅ Good - Preserves structure
from docling_graph.core.extractors import DocumentChunker
chunker = DocumentChunker(
provider="mistral",
max_tokens=4096
)
chunks = chunker.chunk_document(document)
Benefits: - Preserves tables - Keeps lists intact - Respects sections - Maintains context
DocumentChunker¶
Basic Usage¶
import json
from docling_graph.core.extractors import DocumentChunker, DocumentProcessor
from my_templates import ContractTemplate
# Initialize processor
processor = DocumentProcessor(docling_config="ocr")
document = processor.convert_to_docling_doc("document.pdf")
# Initialize chunker
chunker = DocumentChunker(
provider="mistral",
max_tokens=4096
)
# Chunk document
chunks = chunker.chunk_document(document)
print(f"Created {len(chunks)} chunks")
Configuration Options¶
By Provider¶
# Automatic configuration for provider
chunker = DocumentChunker(
provider="mistral", # Auto-configures for Mistral
merge_peers=True
)
Supported providers:
- mistral - Mistral AI models
- openai - OpenAI models
- ollama - Ollama local models
- watsonx - IBM watsonx models
- google - Google Gemini models
Custom Tokenizer¶
# Use specific tokenizer
chunker = DocumentChunker(
tokenizer_name="mistralai/Mistral-7B-Instruct-v0.2",
max_tokens=4096,
merge_peers=True
)
Custom Max Tokens¶
# Override max tokens
chunker = DocumentChunker(
provider="mistral",
max_tokens=8000, # Custom limit
merge_peers=True
)
Structure Preservation¶
What Gets Preserved?¶
The HybridChunker preserves:
- Tables - Never split across chunks
- Lists - Kept intact
- Sections - With headers
- Hierarchies - Parent-child relationships
- Semantic boundaries - Natural breaks
Example: Table Preservation¶
Input document:
# Sales Report
| Product | Q1 | Q2 | Q3 | Q4 |
|---------|----|----|----|----|
| A | 10 | 15 | 20 | 25 |
| B | 5 | 10 | 15 | 20 |
Chunking result:
# ✅ Table stays together in one chunk
chunks = [
"# Sales Report\n\n| Product | Q1 | Q2 | Q3 | Q4 |\n..."
]
Context Enrichment¶
What is Context Enrichment?¶
Chunks are contextualized with metadata: - Section headers - Parent sections - Document structure - Page numbers
Example¶
Original text:
Contextualized chunk:
Why it matters: LLM understands context better.
Real Tokenizer Integration¶
Accurate Token Counting¶
Docling Graph uses real tokenizers instead of character-based heuristics:
from docling_graph.core.extractors import DocumentChunker
# ✅ Good - Real tokenizer (accurate)
chunker = DocumentChunker(
provider="mistral",
max_tokens=4096
)
# Uses Mistral's actual tokenizer for precise counting
# ❌ Old approach - Character heuristic (inaccurate)
# estimated_tokens = len(text) / 4 # Rough approximation
Benefits:
| Feature | Character Heuristic | Real Tokenizer |
|---|---|---|
| Accuracy | ~70% | 95%+ |
| Context Overflows | Occasional | Rare |
| Chunk Efficiency | 60-70% | 80-90% |
| Provider-Specific | No | Yes |
How It Works¶
# Behind the scenes:
# 1. Load provider-specific tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
# 2. Count tokens accurately
tokens = tokenizer.encode(text)
token_count = len(tokens)
# 3. Apply safety margin (20%)
safe_limit = int(max_tokens * 0.8)
# 4. Chunk based on actual token count
if token_count > safe_limit:
# Split into multiple chunks
Safety Margins¶
Why Safety Margins?¶
Even with real tokenizers, we apply a 20% safety margin:
# Example: Model with 8192 token context
max_tokens = 8192
# Effective limit with 20% safety margin
safe_limit = int(max_tokens * 0.8) # 6553 tokens
# Why?
# - Schema takes tokens (~500-2000)
# - System prompts take tokens (~200-500)
# - Response buffer needed (~500-1000)
# - Edge cases and variations
Safety Margin Breakdown:
| Component | Token Usage | Example (8K context) |
|---|---|---|
| Document chunk | 80% | 6553 tokens |
| Schema | 10-15% | 819-1228 tokens |
| System prompt | 3-5% | 245-409 tokens |
| Response buffer | 5-10% | 409-819 tokens |
Configuring Safety Margins¶
# Default: 20% safety margin (recommended)
chunker = DocumentChunker(
provider="mistral",
max_tokens=4096 # Effective: ~3276 tokens per chunk
)
# For aggressive batching (not recommended):
# Modify ChunkBatcher.batch_chunks merge_threshold
# But this increases risk of context overflows
Token Management¶
Token Counting with Statistics¶
# Get detailed token statistics
chunks, stats = chunker.chunk_document_with_stats(document)
print(f"Total chunks: {stats['total_chunks']}")
print(f"Average tokens: {stats['avg_tokens']:.0f}")
print(f"Max tokens: {stats['max_tokens_in_chunk']}")
print(f"Total tokens: {stats['total_tokens']}")
print(f"Safety margin: {(1 - stats['max_tokens_in_chunk']/max_tokens)*100:.1f}%")
Output:
Monitor Safety Margins
If max_tokens_in_chunk is > 95% of max_tokens, consider:
- Reducing
max_tokensparameter - Increasing schema efficiency
- Splitting large tables
Schema-Aware Chunking¶
Dynamic Adjustment Based on Schema¶
Chunk size automatically adjusts based on schema complexity:
import json
from my_templates import ComplexTemplate
# Schema-aware chunking
chunker = DocumentChunker(
provider="mistral",
schema_json=json.dumps(ComplexTemplate.model_json_schema())
)
# Behind the scenes:
# 1. Build prompt skeleton with schema JSON and empty content
# 2. Count exact tokens for system + user prompt
# 3. max_tokens = context_limit - static_overhead - reserved_output - safety_margin
# 4. Chunk with the adjusted limit
chunks = chunker.chunk_document(document)
Schema Size Impact:
Chunk size is computed from exact prompt token counts, so larger schemas reduce available content tokens deterministically without heuristic ratios.
Update Schema Configuration¶
# Update schema JSON after initialization
chunker = DocumentChunker(provider="mistral")
# Later, update for different template
from my_templates import LargeTemplate
import json
chunker.update_schema_config(
schema_json=json.dumps(LargeTemplate.model_json_schema())
)
# Chunker now uses adjusted limits
chunks = chunker.chunk_document(document)
Schema Optimization
To maximize chunk size:
- Keep schemas focused and minimal
- Use field descriptions sparingly
- Avoid deeply nested structures
- Consider splitting large schemas
Merge Peers Option¶
What is Merge Peers?¶
Merge peers combines sibling sections when they fit together:
# Enable merge peers (default)
chunker = DocumentChunker(
provider="mistral",
merge_peers=True # Combine related sections
)
Example¶
Without merge_peers:
With merge_peers:
Benefit: Fewer chunks, better context.
Integration with Pipeline¶
Automatic Chunking¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
use_chunking=True # Automatic chunking (default)
)
run_pipeline(config)
Disable Chunking¶
config = PipelineConfig(
source="small_document.pdf",
template="templates.BillingDocument",
use_chunking=False # Process full document
)
Complete Examples¶
📍 Basic Chunking¶
from docling_graph.core.extractors import DocumentChunker, DocumentProcessor
# Convert document
processor = DocumentProcessor(docling_config="ocr")
document = processor.convert_to_docling_doc("document.pdf")
# Chunk with Mistral settings
chunker = DocumentChunker(provider="mistral")
chunks = chunker.chunk_document(document)
print(f"Created {len(chunks)} chunks")
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i}: {len(chunk)} characters")
📍 With Statistics¶
from docling_graph.core.extractors import DocumentChunker, DocumentProcessor
# Convert and chunk
processor = DocumentProcessor(docling_config="ocr")
document = processor.convert_to_docling_doc("large_document.pdf")
# Get detailed statistics
chunker = DocumentChunker(provider="openai", max_tokens=8000)
chunks, stats = chunker.chunk_document_with_stats(document)
print(f"Chunking Statistics:")
print(f" Total chunks: {stats['total_chunks']}")
print(f" Average tokens: {stats['avg_tokens']:.0f}")
print(f" Max tokens: {stats['max_tokens_in_chunk']}")
print(f" Total tokens: {stats['total_tokens']}")
# Check if any chunk exceeds limit
if stats['max_tokens_in_chunk'] > 8000:
print("Warning: Some chunks exceed token limit!")
📍 Custom Configuration¶
from docling_graph.core.extractors import DocumentChunker, DocumentProcessor
# Custom chunker for specific use case
chunker = DocumentChunker(
tokenizer_name="mistralai/Mistral-7B-Instruct-v0.2",
max_tokens=6000, # Conservative limit
merge_peers=True,
schema_json=json.dumps(ContractTemplate.model_json_schema()),
)
processor = DocumentProcessor(docling_config="ocr")
document = processor.convert_to_docling_doc("contract.pdf")
chunks = chunker.chunk_document(document)
print(f"Created {len(chunks)} optimized chunks")
📍 Fallback Text Chunking¶
from docling_graph.core.extractors import DocumentChunker
# For raw text (when DoclingDocument unavailable)
chunker = DocumentChunker(provider="mistral")
raw_text = """
Long text content that needs to be chunked...
"""
chunks = chunker.chunk_text_fallback(raw_text)
print(f"Created {len(chunks)} text chunks")
Provider-Specific Optimization¶
Mistral AI¶
Context limit: 32K tokens Recommended chunk size: 4096 tokens (with 20% safety margin) Effective chunk size: ~3276 tokens Tokenizer: Mistral-7B-Instruct-v0.2 (real tokenizer)
OpenAI¶
Context limit: 128K tokens Recommended chunk size: 8000 tokens (with 20% safety margin) Effective chunk size: ~6400 tokens Tokenizer: tiktoken (GPT-4) (real tokenizer)
Ollama (Local)¶
Context limit: 8K tokens (typical) Recommended chunk size: 3500 tokens (with 20% safety margin) Effective chunk size: ~2800 tokens Tokenizer: Model-specific (real tokenizer when available)
Ollama Tokenizer Fallback
If model-specific tokenizer is unavailable, falls back to character heuristic with extra safety margin (75% instead of 80%).
IBM watsonx¶
Context limit: 8K tokens Recommended chunk size: 3500 tokens (with 20% safety margin) Effective chunk size: ~2800 tokens Tokenizer: Granite-specific (real tokenizer)
Google Gemini¶
Context limit: 32K-128K tokens (model-dependent) Recommended chunk size: 6000 tokens (with 20% safety margin) Effective chunk size: ~4800 tokens Tokenizer: Gemini-specific (real tokenizer)
Performance Tuning¶
Chunk Size vs Accuracy¶
| Chunk Size | Accuracy | Speed | Memory |
|---|---|---|---|
| Small (2K) | Lower | Fast | Low |
| Medium (4K) | Good | Medium | Medium |
| Large (8K) | Best | Slow | High |
Recommendations¶
# ✅ Good - Balance accuracy and speed
chunker = DocumentChunker(
provider="mistral",
max_tokens=4096 # Sweet spot
)
Troubleshooting¶
🐛 Chunks Too Large¶
Solution:
# Reduce max_tokens
chunker = DocumentChunker(
provider="mistral",
max_tokens=3000 # Smaller chunks
)
🐛 Too Many Chunks¶
Solution:
# Increase max_tokens and enable merge_peers
chunker = DocumentChunker(
provider="openai",
max_tokens=8000, # Larger chunks
merge_peers=True # Combine sections
)
🐛 Tables Split Across Chunks¶
Solution:
# This shouldn't happen with HybridChunker
# If it does, increase max_tokens
chunker = DocumentChunker(
provider="mistral",
max_tokens=6000 # Larger to fit tables
)
🐛 Out of Memory¶
Solution:
# Use smaller chunks
chunker = DocumentChunker(
provider="mistral",
max_tokens=2000, # Smaller chunks
merge_peers=False # Don't combine
)
Best Practices¶
👍 Match Provider¶
# ✅ Good - Match chunker to LLM provider
if using_mistral:
chunker = DocumentChunker(provider="mistral")
elif using_openai:
chunker = DocumentChunker(provider="openai")
👍 Enable Merge Peers¶
# ✅ Good - Better context
chunker = DocumentChunker(
provider="mistral",
merge_peers=True # Recommended
)
👍 Monitor Statistics¶
# ✅ Good - Check chunk distribution
chunks, stats = chunker.chunk_document_with_stats(document)
if stats['max_tokens_in_chunk'] > max_tokens * 0.95:
print("Warning: Chunks near limit")
👍 Adjust for Schema Complexity¶
# ✅ Good - Account for schema JSON
import json
schema_json = json.dumps(template.model_json_schema())
chunker = DocumentChunker(
provider="mistral",
schema_json=schema_json # Dynamic adjustment
)
Advanced Features¶
Custom Tokenizer¶
from transformers import AutoTokenizer
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
# Load custom tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("custom/model")
custom_tokenizer = HuggingFaceTokenizer(
tokenizer=hf_tokenizer,
max_tokens=4096
)
# Use with HybridChunker
from docling.chunking import HybridChunker
chunker = HybridChunker(
tokenizer=custom_tokenizer,
merge_peers=True
)
Recommended Chunk Size Calculation¶
from docling_graph.core.extractors import DocumentChunker
# Calculate recommended size
recommended = DocumentChunker.calculate_recommended_max_tokens(
context_limit=32000, # Mistral Large
system_prompt_tokens=500,
response_buffer_tokens=500
)
print(f"Recommended max_tokens: {recommended}")
# Output: Recommended max_tokens: 24800
Performance Impact¶
Real Tokenizer vs Heuristic¶
Benchmark Results (100-page document):
| Method | Chunks Created | Context Overflows | Processing Time | API Calls |
|---|---|---|---|---|
| Character Heuristic | 45 | 3 (6.7%) | 180s | 48 (3 retries) |
| Real Tokenizer | 38 | 0 (0%) | 152s | 38 (no retries) |
Improvements:
- ✅ 15% fewer chunks (better packing)
- ✅ Zero context overflows (vs 6.7%)
- ✅ 15% faster processing (no retries)
- ✅ 21% fewer API calls (no retries)
Safety Margin Impact¶
| Safety Margin | Chunk Efficiency | Context Overflows | Recommended For |
|---|---|---|---|
| 10% | 90% | Occasional | Aggressive batching |
| 20% (default) | 80% | Rare | General use |
| 30% | 70% | Very rare | Complex schemas |
Next Steps¶
Now that you understand chunking:
- Model Capabilities → - Learn about adaptive prompting
- Extraction Backends → - Learn about LLM and VLM backends
- Batch Processing → - Optimize chunk processing
- Model Merging → - Consolidate chunk extractions
- Performance Tuning → - Advanced optimization