Skip to content

Pipeline Configuration

Overview

Pipeline configuration controls how Docling Graph processes documents and extracts knowledge graphs. The PipelineConfig class provides a type-safe, programmatic way to configure all aspects of the extraction pipeline.

In this section: - Understanding PipelineConfig - Backend selection (LLM vs VLM) - Model configuration - Processing modes - Export settings - Advanced configuration


What is Pipeline Configuration?

Pipeline configuration defines:

  1. What to extract - Source document and template
  2. How to extract - Backend, model, and processing mode
  3. How to process - Chunking, consolidation, and validation
  4. What to export - Output formats and locations

Configuration Methods

You can configure the pipeline in three ways:

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="remote",
    output_dir="outputs"
)

run_pipeline(config)

2. CLI with Flags

uv run docling-graph convert document.pdf \
    --template "templates.BillingDocument" \
    --backend llm \
    --inference remote \
    --output-dir outputs

3. YAML Configuration File

# config.yaml
defaults:
  backend: llm
  inference: remote
  processing_mode: many-to-one
  export_format: csv

models:
  llm:
    remote:
      model: "mistral-small-latest"
      provider: "mistral"

Quick Start

Minimal Configuration

from docling_graph import run_pipeline, PipelineConfig

# Minimal config - uses all defaults
config = PipelineConfig(
    source="document.pdf",
    template="templates.MyTemplate"
)

run_pipeline(config)

Defaults: - Backend: llm - Inference: local - Processing mode: many-to-one - Export format: csv - Output directory: outputs

Common Configurations

Remote API Extraction

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="remote",
    model_override="gpt-4-turbo",
    provider_override="openai"
)

Local GPU Extraction

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local",
    model_override="ibm-granite/granite-4.0-1b",
    provider_override="vllm"
)

VLM (Vision) Extraction

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",
    inference="local",  # VLM only supports local
    docling_config="vision"
)

Configuration Architecture

Configuration Flow

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% Subgraph Styling (Transparent with dashed border for visibility)
    classDef subgraph_style fill:none,stroke:#969696,stroke-width:2px,stroke-dasharray: 5,color:#969696

    %% 2. Define Nodes & Subgraphs
    A@{ shape: procs, label: "PipelineConfig" }

    subgraph Backends ["Backend Configuration"]
        B@{ shape: lin-proc, label: "Backend Selection" }
        F@{ shape: tag-proc, label: "LLM Backend" }
        G@{ shape: tag-proc, label: "VLM Backend" }
    end

    subgraph Models ["Inference Settings"]
        C@{ shape: lin-proc, label: "Model Selection" }
        H@{ shape: tag-proc, label: "Local Inference" }
        I@{ shape: tag-proc, label: "Remote Inference" }
    end

    subgraph Strategy ["Processing Mode"]
        D@{ shape: lin-proc, label: "Processing Mode" }
        J@{ shape: tag-proc, label: "One-to-One" }
        K@{ shape: tag-proc, label: "Many-to-One" }
    end

    subgraph Exports ["Output Settings"]
        E@{ shape: lin-proc, label: "Export Settings" }
        L@{ shape: tag-proc, label: "CSV Export" }
        M@{ shape: tag-proc, label: "Cypher Export" }
    end

    %% 3. Define Connections
    A --> B & C & D & E

    B --> F & G
    C --> H & I
    D --> J & K
    E --> L & M

    %% 4. Apply Classes
    class A config
    class B,C,D,E process
    class F,G,H,I,J,K operator
    class L,M output
    class Backends,Models,Strategy,Exports subgraph_style

Configuration Hierarchy

PipelineConfig
├── Source & Template (required)
│   ├── source: Path to document
│   └── template: Pydantic template
├── Backend Configuration
│   ├── backend: llm | vlm
│   ├── inference: local | remote
│   └── models: Model configurations
├── Processing Configuration
│   ├── processing_mode: one-to-one | many-to-one
│   ├── docling_config: ocr | vision
│   ├── use_chunking: bool
│   └── llm_consolidation: bool
├── Export Configuration
│   ├── export_format: csv | cypher
│   ├── export_docling: bool
│   └── output_dir: Path
└── Advanced Settings
    ├── max_batch_size: int
    ├── reverse_edges: bool
    └── chunker_config: dict

Key Configuration Decisions

1. Backend: LLM vs VLM

Choose LLM when: - Processing text-heavy documents - Need remote API support - Want flexible model selection - Cost is a concern (remote APIs)

Choose VLM when: - Processing image-heavy documents - Need vision understanding - Have local GPU available - Want highest accuracy for complex layouts

See: Backend Selection

2. Inference: Local vs Remote

Choose Local when: - Have GPU available - Processing sensitive data - Need offline capability - Want to avoid API costs

Choose Remote when: - No GPU available - Need quick setup - Want latest models - Processing non-sensitive data

See: Model Configuration

3. Processing Mode: One-to-One vs Many-to-One

Choose One-to-One when: - Documents have distinct pages - Need page-level granularity - Pages are independent

Choose Many-to-One when: - Document is a single entity - Need document-level view - Want consolidated output

See: Processing Modes


Configuration Validation

PipelineConfig validates your configuration:

from docling_graph import run_pipeline, PipelineConfig

# This will raise ValidationError
try:
    config = PipelineConfig(
        source="document.pdf",
        template="templates.BillingDocument",
        backend="vlm",
        inference="remote"  # ❌ VLM doesn't support remote
    )
except ValueError as e:
    print(f"Configuration error: {e}")
    # Output: VLM backend currently only supports local inference

Common Validation Errors

Error Cause Solution
VLM remote inference VLM + remote Use inference="local" or backend="llm"
Missing source No source specified Provide source="path/to/doc"
Missing template No template specified Provide template="module.Class"
Invalid backend Wrong backend value Use "llm" or "vlm"
Invalid inference Wrong inference value Use "local" or "remote"

Default Values

PipelineConfig provides sensible defaults:

# All defaults
PipelineConfig(
    source="",  # Required at runtime
    template="",  # Required at runtime
    backend="llm",
    inference="local",
    processing_mode="many-to-one",
    docling_config="ocr",
    use_chunking=True,
    llm_consolidation=False,
    export_format="csv",
    export_docling=True,
    export_docling_json=True,
    export_markdown=True,
    export_per_page_markdown=False,
    reverse_edges=False,
    output_dir="outputs",
    max_batch_size=1
)

See: Configuration Basics for details on each setting.


Environment Variables

Some settings can be configured via environment variables:

# API Keys
export OPENAI_API_KEY="your-key"
export MISTRAL_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"
export WATSONX_API_KEY="your-key"

# Model Configuration
export VLLM_BASE_URL="http://localhost:8000/v1"
export OLLAMA_BASE_URL="http://localhost:11434"

See: Installation: API Keys


Configuration Best Practices

1. Start Simple

# ✅ Good - Start with defaults
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument"
)

# ❌ Bad - Over-configure initially
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local",
    processing_mode="many-to-one",
    use_chunking=True,
    llm_consolidation=False,
    # ... many more settings
)

2. Override Only What's Needed

# ✅ Good - Override specific settings
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="remote",  # Only change this
    model_override="gpt-4-turbo"  # And this
)

3. Use Type Hints

from docling_graph import run_pipeline, PipelineConfig

# ✅ Good - Type hints help catch errors
config: PipelineConfig = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument"
)

4. Validate Early

# ✅ Good - Validate config before running
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",
    inference="local"
)

# Check config is valid
print(f"Backend: {config.backend}")
print(f"Inference: {config.inference}")

# Then run
run_pipeline(config)

Next Steps

Ready to configure your pipeline?

  1. Configuration Basics → - Learn PipelineConfig fundamentals
  2. Backend Selection - Choose the right backend
  3. Configuration Examples - See complete scenarios