Skip to content

Model Configuration

Overview

Model configuration determines which AI model processes your documents. Docling Graph supports multiple providers for both local and remote inference, giving you flexibility in choosing the right model for your needs.

In this guide: - Local vs remote inference - Supported providers and models - Model capability tiers - Model selection strategies - Provider-specific configuration - Performance and cost considerations

New: Automatic Model Capability Detection

Docling Graph now automatically detects model capabilities based on parameter count and adapts prompts and consolidation strategies accordingly. See Model Capabilities for details.


Local vs Remote Inference

Quick Comparison

Aspect Local Inference Remote Inference
Location Your GPU/CPU Cloud API
Setup Complex (GPU drivers, models) Simple (API key)
Cost Hardware + electricity Pay per token
Speed Fast (with GPU) Variable (network dependent)
Privacy Complete Data sent to provider
Offline Yes No
Models Limited by hardware Latest models available

Local Inference

Overview

Local inference runs models on your own hardware (GPU or CPU). Best for privacy, offline use, and high-volume processing.

Configuration

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="local",  # Local inference
    model_override="ibm-granite/granite-4.0-1b",
    provider_override="vllm"
)

Supported Local Providers

Best for: Fast local LLM inference with GPU

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    model_override="ibm-granite/granite-4.0-1b",
    provider_override="vllm"
)

Setup:

# Install vLLM
uv add vllm

# Start vLLM server
uv run python -m vllm.entrypoints.openai.api_server \
    --model ibm-granite/granite-4.0-1b \
    --port 8000

Supported Models: - ibm-granite/granite-4.0-1b (default, fast) - ibm-granite/granite-4.0-3b (balanced) - meta-llama/Llama-3.1-8B (high quality) - Any HuggingFace model compatible with vLLM

2. Ollama

Best for: Easy local setup, multiple models

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    model_override="llama-3.1-8b",
    provider_override="ollama"
)

Setup:

# Install Ollama (see ollama.ai)
# Pull model
ollama pull llama3.1:8b

# Ollama runs automatically on localhost:11434

Supported Models: - llama3.1:8b (recommended) - mistral:7b - mixtral:8x7b - Any model in Ollama library

3. Docling VLM (For VLM Backend)

Best for: Vision-based extraction

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="vlm",
    inference="local",
    model_override="numind/NuExtract-2.0-8B",
    provider_override="docling"
)

Supported Models: - numind/NuExtract-2.0-8B (default, recommended) - numind/NuExtract-2.0-2B (faster, less accurate)

Local Inference Requirements

Hardware Requirements

Minimum (CPU only): - 16GB RAM - 50GB disk space - Slow processing

Recommended (GPU): - NVIDIA GPU with 8GB+ VRAM - 32GB RAM - 100GB disk space - CUDA 12.1+

Optimal (GPU): - NVIDIA GPU with 24GB+ VRAM (RTX 4090, A100) - 64GB RAM - 200GB SSD - CUDA 12.1+

Software Requirements

# CUDA drivers (for GPU)
nvidia-smi  # Verify CUDA installation

# Python packages
uv add vllm  # For vLLM
# or
# Install Ollama from ollama.ai

See: Installation: GPU Setup


Remote Inference

Overview

Remote inference uses cloud API providers. Best for quick setup, latest models, and no hardware requirements.

Configuration

from docling_graph import run_pipeline, PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    backend="llm",
    inference="remote",  # Remote inference
    model_override="gpt-4-turbo",
    provider_override="openai"
)

Supported Remote Providers

1. OpenAI

Best for: Highest quality, latest models

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="remote",
    model_override="gpt-4-turbo",
    provider_override="openai"
)

Setup:

# Set API key
export OPENAI_API_KEY="your-api-key"

Supported Models: - gpt-4-turbo (recommended, best quality) - gpt-4 (high quality) - gpt-3.5-turbo (fast, economical)

Pricing (approximate): - GPT-4 Turbo: $0.01/1K input tokens, $0.03/1K output tokens - GPT-3.5 Turbo: $0.0005/1K input tokens, $0.0015/1K output tokens

2. Mistral AI

Best for: European provider, good balance

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="remote",
    model_override="mistral-small-latest",
    provider_override="mistral"
)

Setup:

# Set API key
export MISTRAL_API_KEY="your-api-key"

Supported Models: - mistral-small-latest (default, economical) - mistral-medium-latest (balanced) - mistral-large-latest (highest quality)

Pricing (approximate): - Small: $0.001/1K tokens - Medium: $0.0027/1K tokens - Large: $0.008/1K tokens

3. Google Gemini

Best for: Multimodal capabilities, competitive pricing

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="remote",
    model_override="gemini-2.5-flash",
    provider_override="gemini"
)

Setup:

# Set API key
export GOOGLE_API_KEY="your-api-key"

Supported Models: - gemini-2.5-flash (default, fast) - gemini-2.0-pro (high quality)

Pricing (approximate): - Flash: $0.00025/1K input tokens, $0.00075/1K output tokens - Pro: $0.00125/1K input tokens, $0.005/1K output tokens

4. IBM WatsonX

Best for: Enterprise deployments, IBM ecosystem

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="remote",
    model_override="ibm/granite-13b-chat-v2",
    provider_override="watsonx"
)

Setup:

# Set API key and project ID
export WATSONX_API_KEY="your-api-key"
export WATSONX_PROJECT_ID="your-project-id"

See: API Keys Setup for WatsonX configuration details.


Model Capability Tiers

Docling Graph automatically categorizes models into capability tiers based on parameter count:

Tier Overview

Tier Model Size Prompt Style Consolidation Best For
SIMPLE 1B-7B Minimal instructions Basic merge Simple forms, high volume
STANDARD 7B-13B Balanced instructions Standard merge General documents
ADVANCED 13B+ Detailed instructions Chain of Density Complex documents, critical data

Automatic Detection

from docling_graph import run_pipeline, PipelineConfig

# Small model - Automatically uses SIMPLE tier
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    provider_override="vllm",
    model_override="ibm-granite/granite-4.0-1b"  # 1B params → SIMPLE
)

# Medium model - Automatically uses STANDARD tier
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    provider_override="ollama",
    model_override="llama3.1:8b"  # 8B params → STANDARD
)

# Large model - Automatically uses ADVANCED tier
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="remote",
    provider_override="openai",
    model_override="gpt-4-turbo"  # 175B+ params → ADVANCED
)

Performance Impact

Token Usage per Extraction:

Tier Prompt Tokens Consolidation Total Overhead
SIMPLE ~200-300 Single-turn Low
STANDARD ~400-500 Single-turn Medium
ADVANCED ~600-800 Multi-turn (3x) High

Quality vs Speed:

Tier Extraction Quality Speed Cost
SIMPLE 85-90% ⚡ Fast $ Low
STANDARD 90-95% ⚡ Fast $$ Medium
ADVANCED 95-98% 🐢 Slower $$$ High

See Model Capabilities for complete details.


Model Selection Strategies

By Document Complexity

def get_model_config(document_complexity: str):
    """Choose model based on document complexity."""
    if document_complexity == "simple":
        # Simple documents: SIMPLE tier (fast, economical)
        return {
            "inference": "local",
            "model_override": "ibm-granite/granite-4.0-1b",  # SIMPLE tier
            "provider_override": "vllm"
        }
    elif document_complexity == "medium":
        # Medium complexity: STANDARD tier (balanced)
        return {
            "inference": "local",
            "model_override": "llama3.1:8b",  # STANDARD tier
            "provider_override": "ollama"
        }
    else:
        # Complex documents: ADVANCED tier (highest quality)
        return {
            "inference": "remote",
            "model_override": "gpt-4-turbo",  # ADVANCED tier
            "provider_override": "openai"
        }

Capability Tier Matching

Match model capability tier to document complexity:

  • Simple forms/invoices → SIMPLE tier (1B-7B models)
  • General documents → STANDARD tier (7B-13B models)
  • Complex contracts/research → ADVANCED tier (13B+ models)

By Volume

def get_model_config(document_count: int):
    """Choose model based on processing volume."""
    if document_count < 100:
        # Low volume: use best quality
        return {
            "inference": "remote",
            "model_override": "gpt-4-turbo",
            "provider_override": "openai"
        }
    elif document_count < 1000:
        # Medium volume: balanced
        return {
            "inference": "remote",
            "model_override": "mistral-small-latest",
            "provider_override": "mistral"
        }
    else:
        # High volume: use local to avoid costs
        return {
            "inference": "local",
            "model_override": "ibm-granite/granite-4.0-1b",
            "provider_override": "vllm"
        }

By Budget

def get_model_config(budget: str):
    """Choose model based on budget."""
    if budget == "minimal":
        # Minimal cost: local inference
        return {
            "inference": "local",
            "model_override": "ibm-granite/granite-4.0-1b",
            "provider_override": "vllm"
        }
    elif budget == "moderate":
        # Moderate cost: economical API
        return {
            "inference": "remote",
            "model_override": "mistral-small-latest",
            "provider_override": "mistral"
        }
    else:
        # No budget constraint: best quality
        return {
            "inference": "remote",
            "model_override": "gpt-4-turbo",
            "provider_override": "openai"
        }

By Capability Tier

def get_model_by_tier(tier: str):
    """Choose model based on desired capability tier."""
    if tier == "SIMPLE":
        # SIMPLE tier: 1B-7B models
        return {
            "inference": "local",
            "model_override": "ibm-granite/granite-4.0-1b",
            "provider_override": "vllm"
        }
    elif tier == "STANDARD":
        # STANDARD tier: 7B-13B models
        return {
            "inference": "local",
            "model_override": "llama3.1:8b",
            "provider_override": "ollama"
        }
    else:  # ADVANCED
        # ADVANCED tier: 13B+ models
        return {
            "inference": "remote",
            "model_override": "gpt-4-turbo",
            "provider_override": "openai"
        }

By Quality Requirements

def get_model_by_quality(quality_requirement: str):
    """Choose model based on quality requirements."""
    if quality_requirement == "acceptable":
        # 85-90% accuracy: SIMPLE tier
        return {
            "inference": "local",
            "model_override": "ibm-granite/granite-4.0-1b",
            "provider_override": "vllm"
        }
    elif quality_requirement == "high":
        # 90-95% accuracy: STANDARD tier
        return {
            "inference": "local",
            "model_override": "llama3.1:8b",
            "provider_override": "ollama"
        }
    else:  # critical
        # 95-98% accuracy: ADVANCED tier with Chain of Density
        return {
            "inference": "remote",
            "model_override": "gpt-4-turbo",
            "provider_override": "openai",
            "llm_consolidation": True  # Enables Chain of Density
        }

Provider-Specific Configuration

vLLM Configuration

# Custom vLLM base URL
import os
os.environ["VLLM_BASE_URL"] = "http://localhost:8000/v1"

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    provider_override="vllm"
)

Ollama Configuration

# Custom Ollama base URL
import os
os.environ["OLLAMA_BASE_URL"] = "http://localhost:11434"

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    provider_override="ollama"
)

API Key Configuration

# Set via environment variables (recommended)
export OPENAI_API_KEY="your-key"
export MISTRAL_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"
export WATSONX_API_KEY="your-key"
export WATSONX_PROJECT_ID="your-project-id"

Or via .env file:

# .env file
OPENAI_API_KEY=your-key
MISTRAL_API_KEY=your-key
GOOGLE_API_KEY=your-key

See: Installation: API Keys


Performance Comparison

Speed Comparison

Document: 10-page invoice PDF

Local (vLLM, GPU):        ~30 seconds
Local (Ollama, GPU):      ~45 seconds
Remote (GPT-3.5):         ~40 seconds
Remote (GPT-4):           ~60 seconds
Remote (Mistral Small):   ~35 seconds

Quality Comparison

Extraction Accuracy (Complex Documents):

GPT-4 Turbo:              97%
GPT-3.5 Turbo:            92%
Mistral Large:            95%
Mistral Small:            90%
Granite 4.0-1B (local):   88%
Llama 3.1-8B (local):     93%

Cost Comparison

Processing 1000 documents (10 pages each):

Local (vLLM):             $0 (GPU amortized)
Local (Ollama):           $0 (GPU amortized)
Remote (GPT-4):           $150-300
Remote (GPT-3.5):         $10-20
Remote (Mistral Small):   $5-15
Remote (Gemini Flash):    $3-10

Troubleshooting

Local Inference Issues

🐛 CUDA Out of Memory

# Solution: Use smaller model
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    model_override="ibm-granite/granite-4.0-1b",  # Smaller model
    provider_override="vllm"
)

🐛 vLLM Server Not Running

# Check if server is running
curl http://localhost:8000/v1/models

# Start server if needed
uv run python -m vllm.entrypoints.openai.api_server \
    --model ibm-granite/granite-4.0-1b \
    --port 8000

Remote Inference Issues

🐛 API Key Not Found

# Verify API key is set
echo $OPENAI_API_KEY

# Set if missing
export OPENAI_API_KEY="your-key"

🐛 Rate Limit Exceeded

# Solution: Add retry logic or switch provider
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="remote",
    model_override="mistral-small-latest",  # Different provider
    provider_override="mistral"
)

Best Practices

👍 Start with Remote for Testing

# ✅ Good - Quick setup for testing
config = PipelineConfig(
    source="test.pdf",
    template="templates.BillingDocument",
    inference="remote",
    model_override="gpt-3.5-turbo"
)

👍 Use Local for Production Volume

# ✅ Good - Cost-effective for high volume
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    model_override="ibm-granite/granite-4.0-1b"
)

👍 Match Model to Document Complexity

# ✅ Good - Use appropriate model
if document_is_complex:
    model = "gpt-4-turbo"
else:
    model = "gpt-3.5-turbo"

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="remote",
    model_override=model
)

👍 Monitor Costs

# ✅ Good - Track API usage
import logging

logging.info(f"Processing {document_count} documents")
logging.info(f"Estimated cost: ${estimated_cost}")

run_pipeline(config)

Model Recommendations by Use Case

High-Volume Processing

# Use SIMPLE tier for speed and cost efficiency
config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    inference="local",
    model_override="ibm-granite/granite-4.0-1b",  # SIMPLE tier
    provider_override="vllm",
    use_chunking=True,
    llm_consolidation=False  # Fast programmatic merge
)

Benefits:

  • 🔵 Good Accuracy
  • ⚡ Fast Processing

Critical Documents

# Use ADVANCED tier for maximum accuracy
config = PipelineConfig(
    source="contract.pdf",
    template="templates.Contract",
    inference="remote",
    model_override="gpt-4-turbo",  # ADVANCED tier
    provider_override="openai",
    use_chunking=True,
    llm_consolidation=True  # Chain of Density consolidation
)

Benefits:

  • 🟢 High Accuracy
  • 🌀 Multi-turn consolidation

Balanced Approach

# Use STANDARD tier for general documents
config = PipelineConfig(
    source="document.pdf",
    template="templates.Report",
    inference="local",
    model_override="llama3.1:8b",  # STANDARD tier
    provider_override="ollama",
    use_chunking=True,
    llm_consolidation=False  # Standard merge
)

Benefits:

  • 🔵 Good Accuracy
  • ⚖️ Good Balance of Speed and Quality

Next Steps

Now that you understand model configuration:

  1. Model Capabilities → - Learn about capability tiers
  2. Processing Modes → - Choose processing strategy
  3. Configuration Examples - See complete scenarios
  4. Extraction Process - Understand extraction
  5. Performance Tuning → - Optimize performance