Document Conversion¶
Overview¶
Document Conversion is the first stage of the extraction pipeline, transforming raw PDFs and images into structured DoclingDocument format. This stage uses the Docling library to perform OCR, layout analysis, and content extraction.
In this guide: - OCR vs Vision pipelines - Layout analysis - Table extraction - Multi-language support - Performance optimization
Docling Pipelines¶
Quick Comparison¶
| Pipeline | Best For | Speed | Accuracy | GPU Required |
|---|---|---|---|---|
| OCR | Standard documents | Fast | High | No |
| Vision | Complex layouts | Slower | Very High | Recommended |
OCR Pipeline (Default)¶
What is OCR Pipeline?¶
The OCR (Optical Character Recognition) pipeline is the default and most accurate for standard documents. It uses traditional OCR engines combined with layout analysis.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
docling_config="ocr" # Default
)
Features¶
✅ Strengths: - Fast processing - High accuracy for text - Excellent table extraction - Multi-language support - No GPU required
❌ Limitations: - May struggle with complex layouts - Less effective for handwriting - Requires clear text
When to Use OCR¶
Use OCR pipeline for: - Standard business documents - Invoices and forms - Reports and contracts - Documents with clear text - Batch processing (faster)
Vision Pipeline¶
What is Vision Pipeline?¶
The Vision pipeline uses Vision-Language Models (VLMs) to understand document layout and content visually, similar to how humans read documents.
Configuration¶
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
docling_config="vision" # Vision pipeline
)
Features¶
✅ Strengths: - Excellent for complex layouts - Handles handwriting better - Understands visual context - Better for images - Robust to noise
❌ Limitations: - Slower processing - Requires more memory - GPU recommended - Higher resource usage
When to Use Vision¶
Use Vision pipeline for: - Complex layouts (magazines, brochures) - Handwritten documents - Low-quality scans - Documents with images - Visual-heavy content
Document Processor¶
Basic Usage¶
from docling_graph.core.extractors import DocumentProcessor
# Initialize with OCR pipeline
processor = DocumentProcessor(docling_config="ocr")
# Convert document
document = processor.convert_to_docling_doc("document.pdf")
print(f"Converted {document.num_pages()} pages")
With Vision Pipeline¶
# Initialize with Vision pipeline
processor = DocumentProcessor(docling_config="vision")
# Convert document
document = processor.convert_to_docling_doc("complex_document.pdf")
DoclingDocument Structure¶
What is DoclingDocument?¶
A DoclingDocument is a structured representation of your document containing: - Page information - Layout elements - Text content - Tables - Images - Metadata
Accessing Document Data¶
# Get number of pages
num_pages = document.num_pages()
# Get page keys
page_keys = sorted(document.pages.keys())
# Access specific page
page = document.pages[page_keys[0]]
# Get document metadata
metadata = document.metadata
Markdown Extraction¶
Full Document Markdown¶
# Extract complete document as markdown
full_markdown = processor.extract_full_markdown(document)
print(f"Document length: {len(full_markdown)} characters")
Output example:
# Invoice
**Invoice Number:** INV-001
**Date:** 2024-01-15
## Items
| Description | Quantity | Price |
|-------------|----------|-------|
| Product A | 2 | $50 |
| Product B | 1 | $100 |
**Total:** $200
Per-Page Markdown¶
# Extract markdown for each page
page_markdowns = processor.extract_page_markdowns(document)
for i, page_md in enumerate(page_markdowns, 1):
print(f"Page {i}: {len(page_md)} characters")
Layout Analysis¶
What is Layout Analysis?¶
Layout analysis identifies document structure: - Headers and footers - Sections and paragraphs - Tables and lists - Images and figures - Captions and footnotes
Accessing Layout Information¶
# Get document structure
for page_no, page in document.pages.items():
print(f"Page {page_no}:")
# Access layout elements
for element in page.elements:
print(f" - {element.type}: {element.text[:50]}")
Table Extraction¶
Automatic Table Detection¶
The OCR pipeline automatically detects and extracts tables:
# Tables are preserved in markdown
markdown = processor.extract_full_markdown(document)
# Tables appear as markdown tables
# | Column 1 | Column 2 |
# |----------|----------|
# | Value 1 | Value 2 |
Table Structure¶
Tables are extracted with: - Column headers - Row data - Cell alignment - Merged cells (when possible)
Multi-Language Support¶
Supported Languages¶
The OCR pipeline supports multiple languages:
from docling_graph.core.extractors import DocumentProcessor
# Default: English and French
processor = DocumentProcessor(docling_config="ocr")
# Document will be processed with both languages
document = processor.convert_to_docling_doc("multilingual.pdf")
Language Configuration¶
Currently configured for: - English (en) - French (fr)
Language configuration
Language configuration is set in the DocumentProcessor initialization and can be extended by modifying the source code.
Performance Optimization¶
OCR Pipeline Optimization¶
from docling.datamodel.accelerator_options import AcceleratorDevice
# The OCR pipeline is pre-configured with:
# - 4 threads for parallel processing
# - Auto device selection (GPU if available)
# - Optimized table structure matching
Vision Pipeline Optimization¶
# Vision pipeline automatically uses:
# - GPU acceleration (if available)
# - Optimized batch processing
# - Memory-efficient processing
Complete Examples¶
📍 Basic OCR Conversion¶
from docling_graph.core.extractors import DocumentProcessor
# Initialize processor
processor = DocumentProcessor(docling_config="ocr")
# Convert document
document = processor.convert_to_docling_doc("invoice.pdf")
# Extract markdown
markdown = processor.extract_full_markdown(document)
print(f"Converted {document.num_pages()} pages")
print(f"Markdown length: {len(markdown)} characters")
📍 Vision Pipeline for Complex Layout¶
from docling_graph.core.extractors import DocumentProcessor
# Initialize with vision pipeline
processor = DocumentProcessor(docling_config="vision")
# Convert complex document
document = processor.convert_to_docling_doc("magazine.pdf")
# Extract per-page markdown
pages = processor.extract_page_markdowns(document)
for i, page in enumerate(pages, 1):
print(f"Page {i}: {len(page)} characters")
📍 Batch Processing¶
from docling_graph.core.extractors import DocumentProcessor
from pathlib import Path
# Initialize processor once
processor = DocumentProcessor(docling_config="ocr")
# Process multiple documents
for pdf_file in Path("documents").glob("*.pdf"):
print(f"Processing {pdf_file.name}")
document = processor.convert_to_docling_doc(str(pdf_file))
markdown = processor.extract_full_markdown(document)
# Save markdown
output_file = pdf_file.with_suffix(".md")
output_file.write_text(markdown)
# Cleanup resources
processor.cleanup()
Integration with Pipeline¶
Automatic Conversion¶
When using PipelineConfig, conversion happens automatically:
from docling_graph import run_pipeline, PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
docling_config="ocr" # Conversion happens automatically
)
run_pipeline(config)
Manual Conversion¶
For more control, use DocumentProcessor directly:
from docling_graph.core.extractors import DocumentProcessor
# Manual conversion
processor = DocumentProcessor(docling_config="ocr")
document = processor.convert_to_docling_doc("document.pdf")
# Now use document for extraction
# ... extraction code ...
# Cleanup
processor.cleanup()
Export Options¶
Export Docling JSON¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_docling_json=True # Export DoclingDocument as JSON
)
Output: outputs/document.json
Export Markdown¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_markdown=True # Export as markdown
)
Output: outputs/document.md
Export Per-Page Markdown¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_per_page_markdown=True # Export each page
)
Output: outputs/pages/page_001.md, page_002.md, etc.
Advanced Features¶
Custom Pipeline Options¶
For advanced use cases, you can customize the pipeline:
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.accelerator_options import AcceleratorOptions, AcceleratorDevice
# Create custom options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.ocr_options.lang = ["en", "de", "fr"] # Multiple languages
# Set accelerator options
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=8, # More threads
device=AcceleratorDevice.CUDA # Force GPU
)
# Note: This requires modifying DocumentProcessor source code
Troubleshooting¶
🐛 Conversion Fails¶
Solution:
try:
document = processor.convert_to_docling_doc("document.pdf")
except Exception as e:
print(f"Conversion failed: {e}")
# Check if file exists
# Check if file is valid PDF
# Try with different pipeline
🐛 Poor OCR Quality¶
Solution:
# Try Vision pipeline instead
processor = DocumentProcessor(docling_config="vision")
document = processor.convert_to_docling_doc("document.pdf")
🐛 Slow Conversion¶
Solution:
# Use OCR pipeline (faster)
processor = DocumentProcessor(docling_config="ocr")
# Or process in batches
# ... batch processing code ...
🐛 Out of Memory¶
Solution:
# Process pages individually
processor = DocumentProcessor(docling_config="ocr")
document = processor.convert_to_docling_doc("large_doc.pdf")
# Extract per-page to reduce memory
pages = processor.extract_page_markdowns(document)
# Process each page separately
for page in pages:
# ... process page ...
pass
# Cleanup
processor.cleanup()
Best Practices¶
👍 Choose the Right Pipeline¶
# ✅ Good - Match pipeline to document type
if document_is_standard:
docling_config = "ocr" # Faster
else:
docling_config = "vision" # More accurate
👍 Cleanup Resources¶
# ✅ Good - Always cleanup
processor = DocumentProcessor(docling_config="ocr")
try:
document = processor.convert_to_docling_doc("document.pdf")
# ... process document ...
finally:
processor.cleanup()
👍 Reuse Processor for Batch Processing¶
# ✅ Good - Reuse processor
processor = DocumentProcessor(docling_config="ocr")
for pdf_file in pdf_files:
document = processor.convert_to_docling_doc(pdf_file)
# ... process ...
processor.cleanup()
👍 Export for Debugging¶
# ✅ Good - Export markdown for inspection
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
export_markdown=True, # Check conversion quality
export_per_page_markdown=True # Debug per page
)
Next Steps¶
Now that you understand document conversion:
- Chunking Strategies → - Learn intelligent document splitting
- Extraction Backends → - Choose LLM or VLM backend
- Model Merging → - Consolidate extractions