Pipeline Configuration¶
Overview¶
Pipeline configuration controls how Docling Graph processes documents and extracts knowledge graphs. The PipelineConfig class provides a type-safe, programmatic way to configure all aspects of the extraction pipeline.
In this section: - Understanding PipelineConfig - Backend selection (LLM vs VLM) - Model configuration - Processing modes - Export settings - Advanced configuration
What is Pipeline Configuration?¶
Pipeline configuration defines:
- What to extract - Source document and template
- How to extract - Backend, model, and processing mode
- How to process - Chunking, consolidation, and validation
- What to export - Output formats and locations
Configuration Methods¶
You can configure the pipeline in three ways:
1. Python API (Recommended)¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote",
output_dir="outputs"
)
run_pipeline(config)
2. CLI with Flags¶
uv run docling-graph convert document.pdf \
--template "templates.BillingDocument" \
--backend llm \
--inference remote \
--output-dir outputs
3. YAML Configuration File¶
# config.yaml
defaults:
backend: llm
inference: remote
processing_mode: many-to-one
export_format: csv
models:
llm:
remote:
model: "mistral-small-latest"
provider: "mistral"
Quick Start¶
Minimal Configuration¶
from docling_graph import run_pipeline, PipelineConfig
# Minimal config - uses all defaults
config = PipelineConfig(
source="document.pdf",
template="templates.MyTemplate"
)
run_pipeline(config)
Defaults:
- Backend: llm
- Inference: local
- Processing mode: many-to-one
- Export format: csv
- Output directory: outputs
Common Configurations¶
Remote API Extraction¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="remote",
model_override="gpt-4-turbo",
provider_override="openai"
)
Local GPU Extraction¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="local",
model_override="ibm-granite/granite-4.0-1b",
provider_override="vllm"
)
VLM (Vision) Extraction¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm",
inference="local", # VLM only supports local
docling_config="vision"
)
Configuration Architecture¶
Configuration Flow¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% Subgraph Styling (Transparent with dashed border for visibility)
classDef subgraph_style fill:none,stroke:#969696,stroke-width:2px,stroke-dasharray: 5,color:#969696
%% 2. Define Nodes & Subgraphs
A@{ shape: procs, label: "PipelineConfig" }
subgraph Backends ["Backend Configuration"]
B@{ shape: lin-proc, label: "Backend Selection" }
F@{ shape: tag-proc, label: "LLM Backend" }
G@{ shape: tag-proc, label: "VLM Backend" }
end
subgraph Models ["Inference Settings"]
C@{ shape: lin-proc, label: "Model Selection" }
H@{ shape: tag-proc, label: "Local Inference" }
I@{ shape: tag-proc, label: "Remote Inference" }
end
subgraph Strategy ["Processing Mode"]
D@{ shape: lin-proc, label: "Processing Mode" }
J@{ shape: tag-proc, label: "One-to-One" }
K@{ shape: tag-proc, label: "Many-to-One" }
end
subgraph Exports ["Output Settings"]
E@{ shape: lin-proc, label: "Export Settings" }
L@{ shape: tag-proc, label: "CSV Export" }
M@{ shape: tag-proc, label: "Cypher Export" }
end
%% 3. Define Connections
A --> B & C & D & E
B --> F & G
C --> H & I
D --> J & K
E --> L & M
%% 4. Apply Classes
class A config
class B,C,D,E process
class F,G,H,I,J,K operator
class L,M output
class Backends,Models,Strategy,Exports subgraph_style
Configuration Hierarchy¶
PipelineConfig
├── Source & Template (required)
│ ├── source: Path to document
│ └── template: Pydantic template
│
├── Backend Configuration
│ ├── backend: llm | vlm
│ ├── inference: local | remote
│ └── models: Model configurations
│
├── Processing Configuration
│ ├── processing_mode: one-to-one | many-to-one
│ ├── docling_config: ocr | vision
│ ├── use_chunking: bool
│ └── llm_consolidation: bool
│
├── Export Configuration
│ ├── export_format: csv | cypher
│ ├── export_docling: bool
│ └── output_dir: Path
│
└── Advanced Settings
├── max_batch_size: int
├── reverse_edges: bool
└── chunker_config: dict
Key Configuration Decisions¶
1. Backend: LLM vs VLM¶
Choose LLM when: - Processing text-heavy documents - Need remote API support - Want flexible model selection - Cost is a concern (remote APIs)
Choose VLM when: - Processing image-heavy documents - Need vision understanding - Have local GPU available - Want highest accuracy for complex layouts
See: Backend Selection
2. Inference: Local vs Remote¶
Choose Local when: - Have GPU available - Processing sensitive data - Need offline capability - Want to avoid API costs
Choose Remote when: - No GPU available - Need quick setup - Want latest models - Processing non-sensitive data
See: Model Configuration
3. Processing Mode: One-to-One vs Many-to-One¶
Choose One-to-One when: - Documents have distinct pages - Need page-level granularity - Pages are independent
Choose Many-to-One when: - Document is a single entity - Need document-level view - Want consolidated output
See: Processing Modes
Configuration Validation¶
PipelineConfig validates your configuration:
from docling_graph import run_pipeline, PipelineConfig
# This will raise ValidationError
try:
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm",
inference="remote" # ❌ VLM doesn't support remote
)
except ValueError as e:
print(f"Configuration error: {e}")
# Output: VLM backend currently only supports local inference
Common Validation Errors¶
| Error | Cause | Solution |
|---|---|---|
| VLM remote inference | VLM + remote | Use inference="local" or backend="llm" |
| Missing source | No source specified | Provide source="path/to/doc" |
| Missing template | No template specified | Provide template="module.Class" |
| Invalid backend | Wrong backend value | Use "llm" or "vlm" |
| Invalid inference | Wrong inference value | Use "local" or "remote" |
Default Values¶
PipelineConfig provides sensible defaults:
# All defaults
PipelineConfig(
source="", # Required at runtime
template="", # Required at runtime
backend="llm",
inference="local",
processing_mode="many-to-one",
docling_config="ocr",
use_chunking=True,
llm_consolidation=False,
export_format="csv",
export_docling=True,
export_docling_json=True,
export_markdown=True,
export_per_page_markdown=False,
reverse_edges=False,
output_dir="outputs",
max_batch_size=1
)
See: Configuration Basics for details on each setting.
Environment Variables¶
Some settings can be configured via environment variables:
# API Keys
export OPENAI_API_KEY="your-key"
export MISTRAL_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"
export WATSONX_API_KEY="your-key"
# Model Configuration
export VLLM_BASE_URL="http://localhost:8000/v1"
export OLLAMA_BASE_URL="http://localhost:11434"
Configuration Best Practices¶
1. Start Simple¶
# ✅ Good - Start with defaults
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument"
)
# ❌ Bad - Over-configure initially
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="llm",
inference="local",
processing_mode="many-to-one",
use_chunking=True,
llm_consolidation=False,
# ... many more settings
)
2. Override Only What's Needed¶
# ✅ Good - Override specific settings
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
inference="remote", # Only change this
model_override="gpt-4-turbo" # And this
)
3. Use Type Hints¶
from docling_graph import run_pipeline, PipelineConfig
# ✅ Good - Type hints help catch errors
config: PipelineConfig = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument"
)
4. Validate Early¶
# ✅ Good - Validate config before running
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
backend="vlm",
inference="local"
)
# Check config is valid
print(f"Backend: {config.backend}")
print(f"Inference: {config.inference}")
# Then run
run_pipeline(config)
Next Steps¶
Ready to configure your pipeline?
- Configuration Basics → - Learn PipelineConfig fundamentals
- Backend Selection - Choose the right backend
- Configuration Examples - See complete scenarios