Complete Configuration Examples¶
Overview¶
This guide provides complete, real-world configuration examples for common use cases. Each example includes full configuration, expected outputs, and integration patterns.
In this guide: - Production-ready configurations - Common use case patterns - Integration examples - Troubleshooting scenarios - Best practices
Quick Navigation¶
| Use Case | Backend | Inference | Processing |
|---|---|---|---|
| Local Development | LLM | Local | Many-to-one |
| Production API | LLM | Remote | Many-to-one |
| High Accuracy | VLM | Local | One-to-one |
| Batch Processing | LLM | Remote | Many-to-one |
| Rheology Researchs | LLM | Remote | Many-to-one |
| Invoices | LLM | Local | Many-to-one |
| Forms | VLM | Local | One-to-one |
| Multi-language | LLM | Remote | Many-to-one |
Example 1: Local Development¶
Use Case¶
Fast iteration during template development using local models.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
# Source and template
source="test_document.pdf",
template="templates.BillingDocument",
# Local LLM for fast iteration
backend="llm",
inference="local",
provider_override="ollama",
model_override="llama3.1:8b",
# Fast processing
processing_mode="many-to-one",
docling_config="ocr",
use_chunking=True,
# CSV for easy inspection
export_format="csv",
export_markdown=True,
# Development output
output_dir="dev_outputs"
)
run_pipeline(config)
Prerequisites¶
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull model
ollama pull llama3.1:8b
# Verify
ollama list
Expected Output¶
dev_outputs/
├── nodes.csv # Easy to inspect in Excel
├── edges.csv
├── document.md # Check markdown conversion
├── graph_stats.json # Quick metrics
└── visualization.html # Visual inspection
When to Use¶
✅ Use for: - Template development - Quick testing - Debugging extraction - Offline development
Example 2: Production API¶
Use Case¶
Production deployment using remote API for reliability and scale.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
import os
config = PipelineConfig(
# Source and template
source="production_document.pdf",
template="templates.BillingDocument",
# Remote API for reliability
backend="llm",
inference="remote",
provider_override="mistral",
model_override="mistral-large-latest",
# Robust processing
processing_mode="many-to-one",
docling_config="ocr",
use_chunking=True,
llm_consolidation=True, # Extra accuracy
# Cypher for Neo4j
export_format="cypher",
export_docling=False, # Minimal exports
export_markdown=False,
# Production output
output_dir=f"production/{os.getenv('DOCUMENT_ID')}"
)
# Set API key
os.environ["MISTRAL_API_KEY"] = "your_api_key"
run_pipeline(config)
Environment Setup¶
# .env file
MISTRAL_API_KEY=your_api_key_here
DOCUMENT_ID=doc_12345
# Load environment
uv run python -c "from dotenv import load_dotenv; load_dotenv()"
Expected Output¶
production/doc_12345/
├── graph.cypher # Ready for Neo4j import
├── graph_data.json # Backup
├── graph_stats.json # Metrics
└── visualization.html # QA check
Integration¶
# Import to Neo4j
import subprocess
cypher_file = f"production/{os.getenv('DOCUMENT_ID')}/graph.cypher"
subprocess.run([
"cypher-shell",
"-u", "neo4j",
"-p", os.getenv("NEO4J_PASSWORD"),
"-f", cypher_file
])
When to Use¶
✅ Use for: - Production deployments - High reliability needs - Scalable processing - API-based workflows
Example 3: High Accuracy Extraction¶
Use Case¶
Maximum accuracy for complex documents using VLM.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
# Source and template
source="complex_document.pdf",
template="templates.ScholarlyRheologyPaper",
# VLM for visual understanding
backend="vlm",
inference="local",
model_override="numind/NuExtract-2.0-8B",
# Page-by-page for accuracy
processing_mode="one-to-one",
docling_config="vision", # Vision pipeline
# No chunking for VLM
use_chunking=False,
# CSV for analysis
export_format="csv",
export_per_page_markdown=True, # Debug per page
output_dir="high_accuracy_outputs"
)
run_pipeline(config)
Prerequisites¶
Expected Output¶
high_accuracy_outputs/
├── nodes.csv
├── edges.csv
├── pages/
│ ├── page_001.md # Per-page markdown
│ ├── page_002.md
│ └── ...
└── visualization.html
When to Use¶
✅ Use for: - Complex layouts - Visual elements - High accuracy needs - Rheology researchs
Example 4: Batch Processing¶
Use Case¶
Process multiple documents efficiently.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
from pathlib import Path
import logging
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def process_batch(input_dir: str, template: str):
"""Process all PDFs in a directory."""
input_path = Path(input_dir)
results = []
for pdf_file in input_path.glob("*.pdf"):
logger.info(f"Processing {pdf_file.name}")
try:
config = PipelineConfig(
source=str(pdf_file),
template=template,
# Remote for reliability
backend="llm",
inference="remote",
provider_override="mistral",
# Efficient processing
processing_mode="many-to-one",
use_chunking=True,
# Organized outputs
output_dir=f"batch_outputs/{pdf_file.stem}",
export_format="csv"
)
run_pipeline(config)
results.append({"file": pdf_file.name, "status": "success"})
except Exception as e:
logger.error(f"Failed to process {pdf_file.name}: {e}")
results.append({"file": pdf_file.name, "status": "failed", "error": str(e)})
return results
# Process batch
results = process_batch("documents/invoices", "templates.BillingDocument")
# Summary
success_count = sum(1 for r in results if r["status"] == "success")
print(f"Processed {success_count}/{len(results)} documents successfully")
Expected Output¶
batch_outputs/
├── invoice_001/
│ ├── nodes.csv
│ └── edges.csv
├── invoice_002/
│ ├── nodes.csv
│ └── edges.csv
└── ...
When to Use¶
✅ Use for: - Multiple documents - Automated workflows - Scheduled processing - Data pipelines
Example 5: Rheology Researchs¶
Use Case¶
Extract structured data from academic papers.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
# Rheology research
source="research_paper.pdf",
template="templates.ScholarlyRheologyPaper",
# Remote LLM for understanding
backend="llm",
inference="remote",
provider_override="openai",
model_override="gpt-4-turbo",
# Full document context
processing_mode="many-to-one",
docling_config="ocr",
use_chunking=True,
llm_consolidation=True, # Merge findings
# CSV for analysis
export_format="csv",
export_markdown=True,
output_dir="research_outputs"
)
run_pipeline(config)
Template Example¶
from pydantic import BaseModel, Field
from typing import List
class Author(BaseModel):
"""Rheology research author."""
name: str
affiliation: str | None = None
email: str | None = None
class Citation(BaseModel):
"""Paper citation."""
title: str
authors: str
year: int | None = None
class ResearchPaper(BaseModel):
"""Rheology research extraction template."""
title: str = Field(description="Paper title")
authors: List[Author] = Field(description="Paper authors")
abstract: str = Field(description="Paper abstract")
keywords: List[str] = Field(description="Paper keywords")
methodology: str = Field(description="Research methodology")
findings: List[str] = Field(description="Key findings")
citations: List[Citation] = Field(description="Referenced papers")
When to Use¶
✅ Use for: - Academic papers - Literature reviews - Citation extraction - Research analysis
Example 6: Billing Document Processing¶
Use Case¶
Extract invoice data for accounting systems.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
# BillingDocument document
source="invoice.pdf",
template="templates.BillingDocument",
# Local for cost efficiency
backend="llm",
inference="local",
provider_override="ollama",
model_override="llama3.1:8b",
# Fast processing
processing_mode="many-to-one",
docling_config="ocr",
use_chunking=True,
# CSV for accounting software
export_format="csv",
output_dir="invoices/processed"
)
run_pipeline(config)
Template Example¶
from pydantic import BaseModel, Field
from typing import List
from datetime import date
class LineItem(BaseModel):
"""Invoice line item."""
description: str
quantity: float
unit_price: float
total: float
class Address(BaseModel):
"""Address information."""
street: str
city: str
postal_code: str
country: str
class Organization(BaseModel):
"""Organization details."""
name: str
address: Address
tax_id: str | None = None
class BillingDocument(BaseModel):
"""BillingDocument extraction template."""
document_no: str = Field(description="Invoice number")
invoice_date: date = Field(description="Invoice date")
due_date: date | None = Field(description="Payment due date")
issued_by: Organization = Field(description="Issuing organization")
sent_to: Organization = Field(description="Recipient organization")
line_items: List[LineItem] = Field(description="Invoice line items")
subtotal: float = Field(description="Subtotal amount")
tax: float = Field(description="Tax amount")
total: float = Field(description="Total amount")
Integration with Accounting¶
import pandas as pd
# Load extracted data
nodes = pd.read_csv("invoices/processed/nodes.csv")
# Filter invoices
invoices = nodes[nodes['node_type'] == 'BillingDocument']
# Export to accounting system
invoices.to_csv("accounting_import.csv", index=False)
When to Use¶
✅ Use for: - Invoice processing - Accounting automation - Financial data extraction - ERP integration
Example 7: Form Extraction¶
Use Case¶
Extract data from structured forms using VLM.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
# Form document
source="application_form.pdf",
template="templates.ApplicationForm",
# VLM for form structure
backend="vlm",
inference="local",
# Page-by-page for forms
processing_mode="one-to-one",
docling_config="vision",
use_chunking=False,
# CSV for database import
export_format="csv",
output_dir="forms/processed"
)
run_pipeline(config)
Template Example¶
from pydantic import BaseModel, Field, EmailStr
from datetime import date
class Applicant(BaseModel):
"""Applicant information."""
first_name: str
last_name: str
email: EmailStr
phone: str
date_of_birth: date
class Address(BaseModel):
"""Address information."""
street: str
city: str
state: str
zip_code: str
class ApplicationForm(BaseModel):
"""Application form template."""
application_id: str = Field(description="Application ID")
submission_date: date = Field(description="Submission date")
applicant: Applicant = Field(description="Applicant details")
address: Address = Field(description="Applicant address")
employment_status: str = Field(description="Employment status")
annual_income: float | None = Field(description="Annual income")
When to Use¶
✅ Use for: - Application forms - Registration forms - Survey responses - Structured documents
Example 8: Multi-language Documents¶
Use Case¶
Process documents in multiple languages.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
# Multi-language document
source="multilingual_document.pdf",
template="templates.Contract",
# Remote LLM with multi-language support
backend="llm",
inference="remote",
provider_override="openai",
model_override="gpt-4-turbo", # Good multi-language support
# Full document context
processing_mode="many-to-one",
docling_config="ocr",
use_chunking=True,
# CSV export
export_format="csv",
output_dir="multilingual_outputs"
)
run_pipeline(config)
When to Use¶
✅ Use for: - International documents - Multi-language contracts - Global operations - Translation workflows
Advanced Patterns¶
Pattern 1: Error Handling¶
from docling_graph import run_pipeline, PipelineConfig
from docling_graph.exceptions import PipelineError, ExtractionError
import logging
logger = logging.getLogger(__name__)
def safe_process(source: str, template: str) -> bool:
"""Process document with error handling."""
try:
config = PipelineConfig(
source=source,
template=template,
backend="llm",
inference="remote"
)
run_pipeline(config)
logger.info(f"Successfully processed {source}")
return True
except ExtractionError as e:
logger.error(f"Extraction failed for {source}: {e}")
return False
except PipelineError as e:
logger.error(f"Pipeline error for {source}: {e}")
return False
except Exception as e:
logger.error(f"Unexpected error for {source}: {e}")
return False
Pattern 2: Configuration Validation¶
from docling_graph import run_pipeline, PipelineConfig
from pydantic import ValidationError
def validate_config(config_dict: dict) -> bool:
"""Validate configuration before running."""
try:
config = PipelineConfig(**config_dict)
print("✅ Configuration valid")
return True
except ValidationError as e:
print(f"❌ Configuration invalid:")
for error in e.errors():
print(f" - {error['loc']}: {error['msg']}")
return False
# Test configuration
config_dict = {
"source": "document.pdf",
"template": "templates.BillingDocument",
"backend": "llm",
"inference": "remote"
}
if validate_config(config_dict):
config = PipelineConfig(**config_dict)
run_pipeline(config)
Pattern 3: Dynamic Configuration¶
from docling_graph import run_pipeline, PipelineConfig
import os
def get_config_for_document(doc_path: str) -> PipelineConfig:
"""Generate configuration based on document type."""
# Determine document type
if "invoice" in doc_path.lower():
template = "templates.BillingDocument"
processing_mode = "many-to-one"
elif "form" in doc_path.lower():
template = "templates.Form"
processing_mode = "one-to-one"
else:
template = "templates.Generic"
processing_mode = "many-to-one"
# Choose backend based on environment
if os.getenv("USE_GPU") == "true":
backend = "vlm"
inference = "local"
else:
backend = "llm"
inference = "remote"
return PipelineConfig(
source=doc_path,
template=template,
backend=backend,
inference=inference,
processing_mode=processing_mode
)
# Use dynamic configuration
config = get_config_for_document("documents/invoice_001.pdf")
run_pipeline(config)
Best Practices Summary¶
👍 Choose the Right Backend¶
# ✅ Good - Match backend to use case
if document_has_complex_layout:
backend = "vlm"
elif need_fast_iteration:
backend = "llm"
inference = "local"
else:
backend = "llm"
inference = "remote"
👍 Organize Outputs¶
# ✅ Good - Structured output directories
from datetime import datetime
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = f"outputs/{document_type}/{timestamp}"
👍 Handle Errors Gracefully¶
# ✅ Good - Comprehensive error handling
try:
run_pipeline(config)
except ExtractionError:
# Handle extraction failures
pass
except PipelineError:
# Handle pipeline failures
pass
👍 Validate Before Running¶
# ✅ Good - Validate configuration
try:
config = PipelineConfig(**config_dict)
except ValidationError as e:
print(f"Invalid configuration: {e}")
exit(1)
Troubleshooting¶
🐛 Configuration Validation Fails¶
Solution:
from pydantic import ValidationError
try:
config = PipelineConfig(**config_dict)
except ValidationError as e:
print("Configuration errors:")
for error in e.errors():
print(f" {error['loc']}: {error['msg']}")
🐛 Extraction Produces No Results¶
Solution:
# Check extraction output
import json
with open("outputs/graph_stats.json") as f:
stats = json.load(f)
if stats["node_count"] == 0:
print("No nodes extracted - check template and document")
🐛 Out of Memory¶
Solution:
# Use chunking and smaller batch sizes
config = PipelineConfig(
source="large_document.pdf",
template="templates.BillingDocument",
use_chunking=True, # Enable chunking
max_batch_size=1, # Smaller batches
backend="llm",
inference="remote" # Use remote to save memory
)
Next Steps¶
Now that you understand complete configurations:
- Extraction Process → - Learn how extraction works
- Graph Management - Work with extracted graphs
- CLI Guide - Use command-line interface