Input Formats¶
Docling Graph supports multiple input formats, allowing you to process various types of documents and data sources through the same pipeline.
Input Normalization Process¶
The pipeline automatically detects and validates input types, routing them through the appropriate processing stages:
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TD
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
Start@{ shape: terminal, label: "Input Source" }
Detect@{ shape: procs, label: "Input Type Detection" }
%% Validators
ValPDF@{ shape: lin-proc, label: "Validate PDF" }
ValImg@{ shape: lin-proc, label: "Validate Image" }
ValText@{ shape: lin-proc, label: "Validate Text" }
ValMD@{ shape: lin-proc, label: "Validate MD" }
ValDoc@{ shape: lin-proc, label: "Validate Docling" }
%% URL Specifics
ValURL@{ shape: lin-proc, label: "Validate & Download URL" }
CheckDL{"Type?"}
%% Handlers
HandVisual@{ shape: tag-proc, label: "Visual Handler" }
HandText@{ shape: tag-proc, label: "Text Handler" }
HandDoc@{ shape: tag-proc, label: "Object Handler" }
%% Outcomes
SetFlags@{ shape: procs, label: "Set Processing Flags" }
Output@{ shape: doc, label: "Normalized Context" }
%% 3. Define Connections
Start --> Detect
%% Input Detection Routing
Detect -- PDF --> ValPDF
Detect -- Image --> ValImg
Detect -- Text --> ValText
Detect -- MD --> ValMD
Detect -- Docling --> ValDoc
Detect -- URL --> ValURL
%% URL Routing (Feeds back into validators)
ValURL --> CheckDL
CheckDL -- PDF --> ValPDF
CheckDL -- Image --> ValImg
CheckDL -- Text --> ValText
CheckDL -- MD --> ValMD
%% Validation to Handlers (The "Happy Path")
ValPDF & ValImg --> HandVisual
ValText & ValMD --> HandText
ValDoc --> HandDoc
%% Converge Handlers to Output
HandVisual & HandText & HandDoc --> SetFlags --> Output
%% 4. Apply Classes
class Start input
class Detect,SetFlags process
class ValPDF,ValImg,ValText,ValMD,ValURL,ValDoc process
class HandVisual,HandText,HandDoc operator
class CheckDL decision
class Output output
Key Features: - Automatic Type Detection: Identifies input format from file extension, URL, or content - Validation: Ensures input meets requirements (non-empty, correct format, etc.) - Smart Routing: Skips unnecessary stages based on input type - Text/Markdown inputs skip OCR - DoclingDocument inputs skip extraction and go directly to graph conversion - URLs are downloaded and processed based on their content type
Supported Input Formats¶
1. PDF Documents¶
Description: Standard PDF files with text, images, and complex layouts.
File Extensions: .pdf
Processing: Full pipeline with OCR/VLM, segmentation, and extraction.
CLI Example:
Python API Example:
from docling_graph import PipelineConfig, run_pipeline
config = PipelineConfig(
source="document.pdf",
template="templates.billing_document.BillingDocument",
backend="llm",
inference="local",
processing_mode="many-to-one",
docling_config="ocr",
output_dir="outputs",
export_format="csv"
)
run_pipeline(config)
2. Image Files¶
Description: Image files containing document content (scanned documents, photos of documents, etc.).
File Extensions: .png, .jpg, .jpeg
Processing: Full pipeline with OCR/VLM, segmentation, and extraction.
CLI Example:
Python API Example:
config = PipelineConfig(
source="scanned_invoice.jpg",
template="templates.billing_document.BillingDocument",
backend="vlm", # VLM works well with images
inference="local",
processing_mode="one-to-one",
docling_config="vision",
output_dir="outputs",
export_format="json"
)
run_pipeline(config)
3. Plain Text Files¶
Description: Simple text files containing unstructured content.
File Extensions: .txt
Processing: Skips OCR and visual processing. Goes directly to LLM extraction.
Requirements: - Must use LLM backend (VLM requires visual content) - File must not be empty or contain only whitespace
CLI Example:
Python API Example:
config = PipelineConfig(
source="meeting_notes.txt",
template="templates.report.Report",
backend="llm", # Required for text inputs
inference="remote",
processing_mode="many-to-one",
docling_config="ocr", # Ignored for text inputs
output_dir="outputs",
export_format="csv"
)
run_pipeline(config)
4. Markdown Files¶
Description: Markdown-formatted text files with structure and formatting.
File Extensions: .md
Processing: Skips OCR and visual processing. Markdown structure is preserved during extraction.
Requirements: - Must use LLM backend - File must not be empty
CLI Example:
Python API Example:
config = PipelineConfig(
source="documentation.md",
template="templates.documentation.Documentation",
backend="llm",
inference="local",
processing_mode="many-to-one",
output_dir="outputs",
export_format="json"
)
run_pipeline(config)
5. URLs¶
Description: Download and process documents from URLs.
Format: http:// or https:// URLs
Processing: 1. Downloads content to temporary location 2. Detects content type (PDF, image, text, markdown) 3. Routes to appropriate processing pipeline
Supported URL Content Types: - PDF documents - Image files (PNG, JPG, JPEG) - Plain text files - Markdown files
Requirements: - Valid HTTP/HTTPS URL - Accessible without authentication (for now) - File size under limit (default: 100MB)
CLI Example:
# PDF from URL
docling-graph convert https://example.com/invoice.pdf -t templates.billing_document.BillingDocument
# Image from URL
docling-graph convert https://example.com/scan.jpg -t templates.form.Form
# Text from URL
docling-graph convert https://example.com/notes.txt -t templates.report.Report --backend llm
Python API Example:
config = PipelineConfig(
source="https://example.com/document.pdf",
template="templates.billing_document.BillingDocument",
backend="llm",
inference="remote",
processing_mode="many-to-one",
output_dir="outputs",
export_format="csv"
)
run_pipeline(config)
URL Configuration:
from docling_graph.core.input.handlers import URLInputHandler
# Custom timeout and size limit
handler = URLInputHandler(
timeout=30, # seconds
max_size_mb=50 # megabytes
)
6. Plain Text Strings (Python API Only)¶
Description: Raw text strings passed directly to the pipeline.
Format: Python string
Processing: Skips OCR and visual processing. Direct LLM extraction.
Requirements: - Only available via Python API (not CLI) - Must use LLM backend - String must not be empty or whitespace-only
Python API Example:
# Direct text input
text_content = """
Invoice #12345
Date: 2024-01-15
Amount: $1,234.56
Customer: Acme Corp
"""
config = PipelineConfig(
source=text_content, # Pass string directly
template="templates.billing_document.BillingDocument",
backend="llm",
inference="local",
processing_mode="many-to-one",
output_dir="outputs",
export_format="json"
)
run_pipeline(config, mode="api") # mode="api" required
CLI Input Restriction
CLI does not accept plain text strings to avoid ambiguity with file paths.
7. DoclingDocument JSON (Advanced)¶
Description: Pre-processed DoclingDocument JSON files.
File Extensions: .json (with DoclingDocument schema)
Processing: Skips document conversion. Uses pre-existing structure.
Use Cases: - Reprocessing previously converted documents - Custom document preprocessing pipelines - Integration with external Docling workflows
Requirements:
- Valid DoclingDocument JSON schema
- Must include schema_name: "DoclingDocument"
- Must include version field
CLI Example:
Python API Example:
config = PipelineConfig(
source="preprocessed.json",
template="templates.custom.Custom",
backend="llm",
inference="local",
processing_mode="many-to-one",
output_dir="outputs",
export_format="csv"
)
run_pipeline(config)
DoclingDocument JSON Structure:
{
"schema_name": "DoclingDocument",
"version": "1.0.0",
"name": "document_name",
"pages": {
"0": {
"page_no": 0,
"size": {"width": 612, "height": 792}
}
},
"body": {
"self_ref": "#/body",
"children": []
},
"furniture": {}
}
Input Format Detection¶
The pipeline automatically detects input format based on:
- File Extension:
.pdf,.png,.jpg,.txt,.md,.json - URL Scheme:
http://orhttps:// - Content Analysis: For JSON files, checks for DoclingDocument schema
- Fallback: Plain text for unrecognized formats (API mode only)
Detection Examples:
from docling_graph.core.input.types import InputTypeDetector
# Detect from file path
input_type = InputTypeDetector.detect("document.pdf", mode="cli")
# Returns: InputType.PDF
# Detect from URL
input_type = InputTypeDetector.detect("https://example.com/file.txt", mode="cli")
# Returns: InputType.URL
# Detect from text (API mode only)
input_type = InputTypeDetector.detect("Plain text content", mode="api")
# Returns: InputType.TEXT
Processing Pipeline by Input Type¶
Visual Documents (PDF, Images)¶
Text Documents (.txt, .md, plain text)¶
URLs¶
DoclingDocument JSON¶
Backend Compatibility¶
| Input Format | LLM Backend | VLM Backend |
|---|---|---|
| ✅ Yes | ✅ Yes | |
| Images | ✅ Yes | ✅ Yes |
| Text Files | ✅ Yes | ❌ No |
| Markdown | ✅ Yes | ❌ No |
| URLs (PDF/Image) | ✅ Yes | ✅ Yes |
| URLs (Text/MD) | ✅ Yes | ❌ No |
| Plain Text | ✅ Yes | ❌ No |
| DoclingDocument | ✅ Yes | ✅ Yes |
Backend Requirements
VLM (Vision Language Model) backend requires visual content. Use LLM backend for text-only inputs.
Error Handling¶
Empty Files¶
Unsupported Backend¶
$ docling-graph convert notes.txt -t templates.Report --backend vlm
Error: VLM backend does not support text-only inputs. Use LLM backend instead.
Invalid URL¶
$ docling-graph convert ftp://example.com/file.pdf -t templates.BillingDocument
Error: URL must use http or https scheme
File Not Found¶
Best Practices¶
👍 Choose the Right Backend¶
- PDFs and Images: Use VLM for complex layouts, LLM for text-heavy documents
- Text Files: Always use LLM backend
- Mixed Workflows: Use LLM backend for maximum compatibility
👍 Validate Input Files¶
from pathlib import Path
source_path = Path("document.txt")
if not source_path.exists():
raise FileNotFoundError(f"Input file not found: {source_path}")
if source_path.stat().st_size == 0:
raise ValueError("Input file is empty")
👍 Handle URLs Safely¶
from docling_graph.core.input.validators import URLValidator
validator = URLValidator()
try:
validator.validate(url)
except ValidationError as e:
print(f"Invalid URL: {e.message}")
👍 Use Appropriate Processing Modes¶
- one-to-one: Best for multi-page PDFs where each page is independent
- many-to-one: Best for text files and single-entity documents
Troubleshooting¶
🐛 Plain text input is only supported via Python API¶
Cause: Trying to pass plain text string via CLI
Solution: Use Python API or save text to a .txt file first
# Option 1: Use Python API
run_pipeline(config, mode="api")
# Option 2: Save to file
Path("temp.txt").write_text(text_content)
config.source = "temp.txt"
run_pipeline(config, mode="cli")
🐛 VLM backend does not support text-only inputs¶
Cause: Using VLM backend with text files
Solution: Switch to LLM backend
🐛 URL download timeout¶
Cause: Slow network or large file
Solution: Increase timeout or download manually
from docling_graph.core.input.handlers import URLInputHandler
handler = URLInputHandler(timeout=60) # 60 seconds
temp_path = handler.load(url)
Next Steps¶
- Backend Selection - Choose the right backend for your input
- Processing Modes - Understand one-to-one vs many-to-one
- Configuration Examples - See complete configuration examples