Input Formats¶

Docling Graph supports multiple input formats, allowing you to process various types of documents and data sources through the same pipeline.

Input Normalization Process¶

The pipeline automatically detects and validates input types, routing them through the appropriate processing stages:

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TD
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    Start@{ shape: terminal, label: "Input Source" }
    Detect@{ shape: procs, label: "Input Type Detection" }

    %% Validators
    ValPDF@{ shape: lin-proc, label: "Validate PDF" }
    ValImg@{ shape: lin-proc, label: "Validate Image" }
    ValText@{ shape: lin-proc, label: "Validate Text" }
    ValMD@{ shape: lin-proc, label: "Validate MD" }
    ValDoc@{ shape: lin-proc, label: "Validate Docling" }

    %% URL Specifics
    ValURL@{ shape: lin-proc, label: "Validate & Download URL" }
    CheckDL{"Type?"}

    %% Handlers
    HandVisual@{ shape: tag-proc, label: "Visual Handler" }
    HandText@{ shape: tag-proc, label: "Text Handler" }
    HandDoc@{ shape: tag-proc, label: "Object Handler" }

    %% Outcomes
    SetFlags@{ shape: procs, label: "Set Processing Flags" }
    Output@{ shape: doc, label: "Normalized Context" }

    %% 3. Define Connections
    Start --> Detect

    %% Input Detection Routing
    Detect -- PDF --> ValPDF
    Detect -- Image --> ValImg
    Detect -- Text --> ValText
    Detect -- MD --> ValMD
    Detect -- Docling --> ValDoc
    Detect -- URL --> ValURL

    %% URL Routing (Feeds back into validators)
    ValURL --> CheckDL
    CheckDL -- PDF --> ValPDF
    CheckDL -- Image --> ValImg
    CheckDL -- Text --> ValText
    CheckDL -- MD --> ValMD

    %% Validation to Handlers (The "Happy Path")
    ValPDF & ValImg --> HandVisual
    ValText & ValMD --> HandText
    ValDoc --> HandDoc

    %% Converge Handlers to Output
    HandVisual & HandText & HandDoc --> SetFlags --> Output

    %% 4. Apply Classes
    class Start input
    class Detect,SetFlags process
    class ValPDF,ValImg,ValText,ValMD,ValURL,ValDoc process
    class HandVisual,HandText,HandDoc operator
    class CheckDL decision
    class Output output

Key Features: - Automatic Type Detection: Identifies input format from file extension, URL, or content - Validation: Ensures input meets requirements (non-empty, correct format, etc.) - Smart Routing: Skips unnecessary stages based on input type - Text/Markdown inputs skip OCR - DoclingDocument inputs skip extraction and go directly to graph conversion - URLs are downloaded and processed based on their content type

Supported Input Formats¶

1. PDF Documents¶

Description: Standard PDF files with text, images, and complex layouts.

File Extensions: .pdf

Processing: Full pipeline with OCR/VLM, segmentation, and extraction.

CLI Example:

docling-graph convert document.pdf -t templates.billing_document.BillingDocument

Python API Example:

from docling_graph import PipelineConfig, run_pipeline

config = PipelineConfig(
    source="document.pdf",
    template="templates.billing_document.BillingDocument",
    backend="llm",
    inference="local",
    processing_mode="many-to-one",
    docling_config="ocr",
    output_dir="outputs",
    export_format="csv"
)

run_pipeline(config)

2. Image Files¶

Description: Image files containing document content (scanned documents, photos of documents, etc.).

File Extensions: .png, .jpg, .jpeg

Processing: Full pipeline with OCR/VLM, segmentation, and extraction.

CLI Example:

docling-graph convert scanned_invoice.png -t templates.billing_document.BillingDocument

Python API Example:

config = PipelineConfig(
    source="scanned_invoice.jpg",
    template="templates.billing_document.BillingDocument",
    backend="vlm",  # VLM works well with images
    inference="local",
    processing_mode="one-to-one",
    docling_config="vision",
    output_dir="outputs",
    export_format="json"
)

run_pipeline(config)

3. Plain Text Files¶

Description: Simple text files containing unstructured content.

File Extensions: .txt

Processing: Skips OCR and visual processing. Goes directly to LLM extraction.

Requirements: - Must use LLM backend (VLM requires visual content) - File must not be empty or contain only whitespace

CLI Example:

docling-graph convert notes.txt -t templates.report.Report --backend llm

Python API Example:

config = PipelineConfig(
    source="meeting_notes.txt",
    template="templates.report.Report",
    backend="llm",  # Required for text inputs
    inference="remote",
    processing_mode="many-to-one",
    docling_config="ocr",  # Ignored for text inputs
    output_dir="outputs",
    export_format="csv"
)

run_pipeline(config)

4. Markdown Files¶

Description: Markdown-formatted text files with structure and formatting.

File Extensions: .md

Processing: Skips OCR and visual processing. Markdown structure is preserved during extraction.

Requirements: - Must use LLM backend - File must not be empty

CLI Example:

docling-graph convert README.md -t templates.documentation.Documentation --backend llm

Python API Example:

config = PipelineConfig(
    source="documentation.md",
    template="templates.documentation.Documentation",
    backend="llm",
    inference="local",
    processing_mode="many-to-one",
    output_dir="outputs",
    export_format="json"
)

run_pipeline(config)

5. URLs¶

Description: Download and process documents from URLs.

Format: http:// or https:// URLs

Processing: 1. Downloads content to temporary location 2. Detects content type (PDF, image, text, markdown) 3. Routes to appropriate processing pipeline

Supported URL Content Types: - PDF documents - Image files (PNG, JPG, JPEG) - Plain text files - Markdown files

Requirements: - Valid HTTP/HTTPS URL - Accessible without authentication (for now) - File size under limit (default: 100MB)

CLI Example:

# PDF from URL
docling-graph convert https://example.com/invoice.pdf -t templates.billing_document.BillingDocument

# Image from URL
docling-graph convert https://example.com/scan.jpg -t templates.form.Form

# Text from URL
docling-graph convert https://example.com/notes.txt -t templates.report.Report --backend llm

Python API Example:

config = PipelineConfig(
    source="https://example.com/document.pdf",
    template="templates.billing_document.BillingDocument",
    backend="llm",
    inference="remote",
    processing_mode="many-to-one",
    output_dir="outputs",
    export_format="csv"
)

run_pipeline(config)

URL Configuration:

from docling_graph.core.input.handlers import URLInputHandler

# Custom timeout and size limit
handler = URLInputHandler(
    timeout=30,      # seconds
    max_size_mb=50   # megabytes
)

6. Plain Text Strings (Python API Only)¶

Description: Raw text strings passed directly to the pipeline.

Format: Python string

Processing: Skips OCR and visual processing. Direct LLM extraction.

Requirements: - Only available via Python API (not CLI) - Must use LLM backend - String must not be empty or whitespace-only

Python API Example:

# Direct text input
text_content = """
Invoice #12345
Date: 2024-01-15
Amount: $1,234.56
Customer: Acme Corp
"""

config = PipelineConfig(
    source=text_content,  # Pass string directly
    template="templates.billing_document.BillingDocument",
    backend="llm",
    inference="local",
    processing_mode="many-to-one",
    output_dir="outputs",
    export_format="json"
)

run_pipeline(config, mode="api")  # mode="api" required

CLI Input Restriction

CLI does not accept plain text strings to avoid ambiguity with file paths.

7. DoclingDocument JSON (Advanced)¶

Description: Pre-processed DoclingDocument JSON files.

File Extensions: .json (with DoclingDocument schema)

Processing: Skips document conversion. Uses pre-existing structure.

Use Cases: - Reprocessing previously converted documents - Custom document preprocessing pipelines - Integration with external Docling workflows

Requirements: - Valid DoclingDocument JSON schema - Must include schema_name: "DoclingDocument" - Must include version field

CLI Example:

docling-graph convert processed_document.json -t templates.custom.Custom

Python API Example:

config = PipelineConfig(
    source="preprocessed.json",
    template="templates.custom.Custom",
    backend="llm",
    inference="local",
    processing_mode="many-to-one",
    output_dir="outputs",
    export_format="csv"
)

run_pipeline(config)

DoclingDocument JSON Structure:

{
  "schema_name": "DoclingDocument",
  "version": "1.0.0",
  "name": "document_name",
  "pages": {
    "0": {
      "page_no": 0,
      "size": {"width": 612, "height": 792}
    }
  },
  "body": {
    "self_ref": "#/body",
    "children": []
  },
  "furniture": {}
}

Input Format Detection¶

The pipeline automatically detects input format based on:

File Extension: .pdf, .png, .jpg, .txt, .md, .json
URL Scheme: http:// or https://
Content Analysis: For JSON files, checks for DoclingDocument schema
Fallback: Plain text for unrecognized formats (API mode only)

Detection Examples:

from docling_graph.core.input.types import InputTypeDetector

# Detect from file path
input_type = InputTypeDetector.detect("document.pdf", mode="cli")
# Returns: InputType.PDF

# Detect from URL
input_type = InputTypeDetector.detect("https://example.com/file.txt", mode="cli")
# Returns: InputType.URL

# Detect from text (API mode only)
input_type = InputTypeDetector.detect("Plain text content", mode="api")
# Returns: InputType.TEXT

Processing Pipeline by Input Type¶

Visual Documents (PDF, Images)¶

Input → Document Conversion (OCR/VLM) → Segmentation → 
Extraction → Graph Construction → Export

Text Documents (.txt, .md, plain text)¶

Input → Text Normalization → Extraction (LLM only) → 
Graph Construction → Export

URLs¶

URL → Download → Type Detection → Route to appropriate pipeline

DoclingDocument JSON¶

Input → Validation → Graph Construction → Export
(Skips conversion and extraction)

Backend Compatibility¶

Input Format	LLM Backend	VLM Backend
PDF	✅ Yes	✅ Yes
Images	✅ Yes	✅ Yes
Text Files	✅ Yes	❌ No
Markdown	✅ Yes	❌ No
URLs (PDF/Image)	✅ Yes	✅ Yes
URLs (Text/MD)	✅ Yes	❌ No
Plain Text	✅ Yes	❌ No
DoclingDocument	✅ Yes	✅ Yes

Backend Requirements

VLM (Vision Language Model) backend requires visual content. Use LLM backend for text-only inputs.

Error Handling¶

Empty Files¶

$ docling-graph convert empty.txt -t templates.Report
Error: Text input is empty

Unsupported Backend¶

$ docling-graph convert notes.txt -t templates.Report --backend vlm
Error: VLM backend does not support text-only inputs. Use LLM backend instead.

Invalid URL¶

$ docling-graph convert ftp://example.com/file.pdf -t templates.BillingDocument
Error: URL must use http or https scheme

File Not Found¶

$ docling-graph convert missing.pdf -t templates.BillingDocument
Error: File not found: missing.pdf

Best Practices¶

👍 Choose the Right Backend¶

PDFs and Images: Use VLM for complex layouts, LLM for text-heavy documents
Text Files: Always use LLM backend
Mixed Workflows: Use LLM backend for maximum compatibility

👍 Validate Input Files¶

from pathlib import Path

source_path = Path("document.txt")
if not source_path.exists():
    raise FileNotFoundError(f"Input file not found: {source_path}")

if source_path.stat().st_size == 0:
    raise ValueError("Input file is empty")

👍 Handle URLs Safely¶

from docling_graph.core.input.validators import URLValidator

validator = URLValidator()
try:
    validator.validate(url)
except ValidationError as e:
    print(f"Invalid URL: {e.message}")

👍 Use Appropriate Processing Modes¶

one-to-one: Best for multi-page PDFs where each page is independent
many-to-one: Best for text files and single-entity documents

Troubleshooting¶

🐛 Plain text input is only supported via Python API¶

Cause: Trying to pass plain text string via CLI

Solution: Use Python API or save text to a .txt file first

# Option 1: Use Python API
run_pipeline(config, mode="api")

# Option 2: Save to file
Path("temp.txt").write_text(text_content)
config.source = "temp.txt"
run_pipeline(config, mode="cli")

🐛 VLM backend does not support text-only inputs¶

Cause: Using VLM backend with text files

Solution: Switch to LLM backend

docling-graph convert notes.txt -t templates.Report --backend llm

🐛 URL download timeout¶

Cause: Slow network or large file

Solution: Increase timeout or download manually

from docling_graph.core.input.handlers import URLInputHandler

handler = URLInputHandler(timeout=60)  # 60 seconds
temp_path = handler.load(url)

Next Steps¶

Backend Selection - Choose the right backend for your input
Processing Modes - Understand one-to-one vs many-to-one
Configuration Examples - See complete configuration examples