PipelineConfig¶
Overview¶
PipelineConfig is a type-safe configuration class built with Pydantic that provides validation, defaults, and IDE autocomplete for pipeline configuration.
Key Features: - Type validation - Default values - IDE autocomplete - Validation errors - Convenience methods
Basic Usage¶
from docling_graph import run_pipeline, PipelineConfig
# Create configuration
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote"
)
# Run pipeline
run_pipeline(config)
Constructor Parameters¶
Required Parameters¶
| Parameter | Type | Description |
|---|---|---|
source |
str | Path |
Path to source document |
template |
str | Type[BaseModel] |
Pydantic template (dotted path or class) |
Core Settings¶
| Parameter | Type | Default | Description |
|---|---|---|---|
backend |
Literal["llm", "vlm"] |
"llm" |
Backend type |
inference |
Literal["local", "remote"] |
"local" |
Inference mode |
processing_mode |
Literal["one-to-one", "many-to-one"] |
"many-to-one" |
Processing strategy |
Docling Settings¶
| Parameter | Type | Default | Description |
|---|---|---|---|
docling_config |
Literal["ocr", "vision"] |
"ocr" |
Docling pipeline type |
Model Overrides¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model_override |
str | None |
None |
Override model name |
provider_override |
str | None |
None |
Override provider name |
Custom LLM Client¶
| Parameter | Type | Default | Description |
|---|---|---|---|
llm_client |
LLMClientProtocol \| None |
None |
Custom LLM client instance. When set, the pipeline uses this client for all LLM calls and does not initialize a provider/model from config. Use this to target a custom inference URL, on-prem endpoint, or any client implementing get_json_response(prompt, schema_json) -> dict | list. |
Usage: Pass any object that implements LLMClientProtocol (e.g. a LiteLLM-backed client with a custom base_url). See LLM Clients — Custom LLM Clients for a full example.
from docling_graph import PipelineConfig, run_pipeline
# Your custom client (must implement get_json_response(prompt, schema_json))
custom_client = MyLiteLLMEndpointClient(base_url="https://...", model="openai/...")
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
llm_client=custom_client,
)
run_pipeline(config) # or config.run()
Extraction Settings¶
| Parameter | Type | Default | Description |
|---|---|---|---|
use_chunking |
bool |
True |
Enable document chunking |
llm_consolidation |
bool |
False |
Enable LLM consolidation |
max_batch_size |
int |
1 |
Maximum batch size |
Debug Settings¶
| Parameter | Type | Default | Description |
|---|---|---|---|
debug |
bool |
False |
Enable debug mode to save all intermediate extraction artifacts |
Export Settings¶
| Parameter | Type | Default | Description |
|---|---|---|---|
dump_to_disk |
bool | None |
None |
Control file exports. None=auto (CLI=True, API=False), True=always, False=never |
export_format |
Literal["csv", "cypher"] |
"csv" |
Export format |
export_docling |
bool |
True |
Export Docling outputs |
export_docling_json |
bool |
True |
Export Docling JSON |
export_markdown |
bool |
True |
Export markdown |
export_per_page_markdown |
bool |
False |
Export per-page markdown |
Graph Settings¶
| Parameter | Type | Default | Description |
|---|---|---|---|
reverse_edges |
bool |
False |
Create bidirectional edges |
Output Settings¶
| Parameter | Type | Default | Description |
|---|---|---|---|
output_dir |
str | Path |
"outputs" |
Output directory path |
Models Configuration¶
| Parameter | Type | Default | Description |
|---|---|---|---|
models |
ModelsConfig |
Default models | Models configuration |
Methods¶
run()¶
Execute the pipeline with this configuration.
from docling_graph import run_pipeline
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument"
)
# Returns PipelineContext with results
context = run_pipeline(config)
graph = context.knowledge_graph
Returns: PipelineContext - Contains knowledge graph, Pydantic model, and other results
Raises: PipelineError, ConfigurationError, ExtractionError
Accessing pipeline return values
Use run_pipeline(config) instead of config.run() to access return values.
to_dict()¶
Convert configuration to dictionary format.
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument"
)
config_dict = config.to_dict()
print(config_dict)
# {
# "source": "document.pdf",
# "template": "templates.BillingDocument",
# "backend": "llm",
# ...
# }
Returns: Dict[str, Any]
Complete Examples¶
📍 Minimal Config (API Mode)¶
from docling_graph import run_pipeline, PipelineConfig
# Only required parameters - no file exports by default
config = PipelineConfig(
source="invoice.pdf",
template="templates.BillingDocument"
)
# Returns data in memory
context = run_pipeline(config)
graph = context.knowledge_graph
invoice = context.pydantic_model
📍 Debug Mode Enabled¶
from docling_graph import run_pipeline, PipelineConfig
# Enable debug mode for troubleshooting
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
debug=True, # Save all intermediate artifacts
dump_to_disk=True, # Also save final outputs
output_dir="outputs/debug_run"
)
context = run_pipeline(config)
# Debug artifacts available at:
# outputs/debug_run/document_pdf_20260206_094500/debug/
print(f"Debug artifacts saved to: {context.output_dir}/debug/")
Output Settings¶
| Parameter | Type | Default | Description |
|---|---|---|---|
output_dir |
str | Path |
"outputs" |
Output directory path |
📍 Minimal Config With File Exports¶
from docling_graph import run_pipeline, PipelineConfig
# Enable file exports
config = PipelineConfig(
source="invoice.pdf",
template="templates.BillingDocument",
dump_to_disk=True,
output_dir="outputs/invoice"
)
# Returns data AND writes files
context = run_pipeline(config)
📍 Remote LLM¶
import os
from docling_graph import run_pipeline, PipelineConfig
# Set API key
os.environ["MISTRAL_API_KEY"] = "your-key"
# Configure for remote inference
config = PipelineConfig(
source="research.pdf",
template="templates.ScholarlyRheologyPaper",
backend="llm",
inference="remote",
provider_override="mistral",
model_override="mistral-large-latest",
processing_mode="many-to-one",
use_chunking=True,
llm_consolidation=True
)
run_pipeline(config)
📍 Local VLM¶
from docling_graph import run_pipeline, PipelineConfig
# VLM for form extraction
config = PipelineConfig(
source="form.jpg",
template="templates.IDCard",
backend="vlm",
inference="local", # VLM only supports local
processing_mode="one-to-one",
docling_config="vision"
)
run_pipeline(config)
📍 Template as Class¶
from pydantic import BaseModel, Field
from docling_graph import run_pipeline, PipelineConfig
# Define template inline
class Invoice(BaseModel):
"""Invoice template."""
invoice_number: str = Field(description="Invoice number")
total: float = Field(description="Total amount")
# Pass class directly
config = PipelineConfig(
source="invoice.pdf",
template=Invoice # Class instead of string
)
run_pipeline(config)
📍 Custom Models Configuration¶
from docling_graph import LLMConfig, ModelConfig, ModelsConfig, PipelineConfig, run_pipeline
# Custom models configuration
models = ModelsConfig(
llm=LLMConfig(
remote=ModelConfig(
model="gpt-4o",
provider="openai"
)
)
)
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote",
models=models
)
run_pipeline(config)
For full registry and override details, see docs/usage/api/llm-model-config.md.
Validation¶
Automatic Validation¶
PipelineConfig validates parameters at creation:
from docling_graph import run_pipeline, PipelineConfig
# This raises ValidationError
try:
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="invalid" # Invalid value
)
except ValueError as e:
print(f"Validation error: {e}")
VLM Constraints¶
VLM backend only supports local inference:
from docling_graph import run_pipeline, PipelineConfig
# This raises ValidationError
try:
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm",
inference="remote" # Not allowed for VLM
)
except ValueError as e:
print(f"VLM only supports local inference: {e}")
Type Safety Benefits¶
IDE Autocomplete¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm", # IDE suggests: "llm" | "vlm"
inference="remote", # IDE suggests: "local" | "remote"
processing_mode="many-to-one" # IDE suggests valid options
)
Type Checking¶
from docling_graph import run_pipeline, PipelineConfig
# mypy will catch this error
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
use_chunking="yes" # Error: expected bool, got str
)
Advanced Usage¶
Programmatic Configuration¶
from docling_graph import run_pipeline, PipelineConfig
from pathlib import Path
def create_config(source: str, template: str, use_remote: bool = False):
"""Factory function for creating configurations."""
return PipelineConfig(
source=source,
template=template,
backend="llm",
inference="remote" if use_remote else "local",
provider_override="mistral" if use_remote else "ollama"
)
# Use factory
config = create_config("document.pdf", "templates.BillingDocument", use_remote=True)
run_pipeline(config)
Configuration Templates¶
from docling_graph import run_pipeline, PipelineConfig
# Base configuration
BASE_CONFIG = {
"backend": "llm",
"inference": "remote",
"provider_override": "mistral",
"use_chunking": True,
"llm_consolidation": False
}
# Create specific configurations
def process_invoice(source: str):
config = PipelineConfig(
source=source,
template="templates.BillingDocument",
**BASE_CONFIG
)
run_pipeline(config)
def process_research(source: str):
config = PipelineConfig(
source=source,
template="templates.ScholarlyRheologyPaper",
**BASE_CONFIG,
llm_consolidation=True # Override for research
)
run_pipeline(config)
Dynamic Configuration¶
from docling_graph import run_pipeline, PipelineConfig
from pathlib import Path
def smart_config(source: str) -> PipelineConfig:
"""Create configuration based on document characteristics."""
path = Path(source)
file_size = path.stat().st_size
# Choose settings based on file size
if file_size < 1_000_000: # < 1MB
use_chunking = False
processing = "one-to-one"
else:
use_chunking = True
processing = "many-to-one"
# Choose backend based on extension
if path.suffix.lower() in ['.jpg', '.png']:
backend = "vlm"
else:
backend = "llm"
return PipelineConfig(
source=source,
template="templates.BillingDocument",
backend=backend,
processing_mode=processing,
use_chunking=use_chunking
)
# Use smart configuration
config = smart_config("document.pdf")
run_pipeline(config)
Configuration Patterns¶
Pattern 1: Environment-Based Configuration¶
import os
from docling_graph import run_pipeline, PipelineConfig
def get_config(source: str, template: str) -> PipelineConfig:
"""Get configuration based on environment."""
env = os.getenv("ENVIRONMENT", "development")
if env == "production":
return PipelineConfig(
source=source,
template=template,
backend="llm",
inference="remote",
provider_override="mistral",
model_override="mistral-large-latest",
llm_consolidation=True
)
else:
return PipelineConfig(
source=source,
template=template,
backend="llm",
inference="local",
provider_override="ollama",
llm_consolidation=False
)
config = get_config("document.pdf", "templates.BillingDocument")
run_pipeline(config)
Pattern 2: Configuration Builder¶
from docling_graph import run_pipeline, PipelineConfig
class ConfigBuilder:
"""Builder pattern for PipelineConfig."""
def __init__(self, source: str, template: str):
self.config_dict = {
"source": source,
"template": template
}
def with_remote_llm(self, provider: str, model: str):
self.config_dict.update({
"backend": "llm",
"inference": "remote",
"provider_override": provider,
"model_override": model
})
return self
def with_chunking(self, enabled: bool = True):
self.config_dict["use_chunking"] = enabled
return self
def with_consolidation(self, enabled: bool = True):
self.config_dict["llm_consolidation"] = enabled
return self
def build(self) -> PipelineConfig:
return PipelineConfig(**self.config_dict)
# Use builder
config = (ConfigBuilder("document.pdf", "templates.BillingDocument")
.with_remote_llm("mistral", "mistral-large-latest")
.with_chunking(True)
.with_consolidation(True)
.build())
run_pipeline(config)
Best Practices¶
👍 Use Type-Safe Configuration¶
# ✅ Good - Type-safe with validation
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm" # Validated at creation
)
# ❌ Avoid - Dictionary without validation
config = {
"source": "document.pdf",
"template": "templates.BillingDocument",
"backend": "invalid" # No validation
}
👍 Use Defaults When Possible¶
# ✅ Good - Rely on sensible defaults
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument"
# Uses default backend, inference, etc.
)
# ❌ Avoid - Specifying every parameter
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="local",
processing_mode="many-to-one",
use_chunking=True,
# ... all defaults
)
Troubleshooting¶
🐛 Validation Error¶
Error:
Solution:
# Use valid values
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm" # Valid: "llm" or "vlm"
)
🐛 VLM Remote Inference¶
Error:
Solution:
# VLM only supports local
config = PipelineConfig(
source="form.jpg",
template="templates.IDCard",
backend="vlm",
inference="local" # Must be local for VLM
)
Next Steps¶
- Programmatic Examples → - More code examples
- Batch Processing → - Batch patterns
- API Reference → - Complete API docs