Processing Modes¶

Overview¶

Processing modes determine how Docling Graph handles multi-page documents. The choice between one-to-one and many-to-one modes significantly affects extraction results and graph structure.

In this guide: - One-to-one vs many-to-one comparison - When to use each mode - Graph structure differences - Performance implications - Mode-specific configuration

Processing Mode Comparison¶

Quick Comparison¶

Aspect	One-to-One	Many-to-One
Processing	Each page separately	Whole document together
Output	N models (one per page)	1 merged model
Best For	Independent pages	Single document entity
Graph Nodes	More nodes	Fewer nodes
Context	Page-level	Document-level
Speed	Slower (N extractions)	Faster (1 extraction)
Accuracy	Page-specific	Document-wide

One-to-One Mode¶

What is One-to-One?¶

One-to-one mode processes each page independently, creating separate extraction results for each page. Best for documents where pages are independent entities.

Configuration¶

from docling_graph import PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="one-to-one"  # Process each page separately
)

How It Works¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "PDF Document" }

    B@{ shape: doc, label: "Page 1" }
    C@{ shape: doc, label: "Page 2" }
    D@{ shape: doc, label: "Page 3" }

    E@{ shape: tag-proc, label: "Extract 1" }
    F@{ shape: tag-proc, label: "Extract 2" }
    G@{ shape: tag-proc, label: "Extract 3" }

    H@{ shape: procs, label: "Model 1" }
    I@{ shape: procs, label: "Model 2" }
    J@{ shape: procs, label: "Model 3" }

    %% 3. Define Connections
    A --> B & C & D

    B --> E
    C --> F
    D --> G

    E --> H
    F --> I
    G --> J

    %% 4. Apply Classes
    class A input
    class B,C,D data
    class E,F,G operator
    class H,I,J output

When to Use One-to-One¶

✅ Use one-to-one when: - Each page is an independent document (e.g., batch of invoices) - Pages have different structures - You need page-level granularity - Pages represent separate entities - You want to track which page data came from

❌ Don't use one-to-one when: - Document is a single entity spanning multiple pages - Pages are continuation of same content - You need document-wide context - You want a single consolidated result

Example Use Cases¶

Use Case 1: Batch Invoice Processing¶

# Multiple invoices in one PDF
config = PipelineConfig(
    source="invoices_batch.pdf",  # 10 invoices, 1 page each
    template="templates.BillingDocument",
    processing_mode="one-to-one"  # Each page is separate invoice
)

Result: 10 BillingDocument models, one per page

Use Case 2: Form Collection¶

# Multiple forms in one PDF
config = PipelineConfig(
    source="forms_collection.pdf",  # 20 forms
    template="templates.ApplicationForm",
    processing_mode="one-to-one"  # Each page is separate form
)

Result: 20 ApplicationForm models

Use Case 3: ID Card Batch¶

# Multiple ID cards scanned together
config = PipelineConfig(
    source="id_cards_batch.pdf",  # 50 ID cards
    template="templates.IDCard",
    processing_mode="one-to-one"  # Each page is separate ID
)

Result: 50 IDCard models

Graph Structure¶

One-to-one creates multiple root nodes:

Invoice-Page1 (node)
  ├─ ISSUED_BY → Organization-A
  └─ SENT_TO → Client-A

Invoice-Page2 (node)
  ├─ ISSUED_BY → Organization-B
  └─ SENT_TO → Client-B

Invoice-Page3 (node)
  ├─ ISSUED_BY → Organization-C
  └─ SENT_TO → Client-C

Performance Characteristics¶

Document: 10-page PDF

One-to-One Processing:
- Extractions: 10 (one per page)
- Memory: Moderate (sequential processing)
- Output: 10 separate models

Many-to-One Mode¶

What is Many-to-One?¶

Many-to-one mode processes the entire document as a single entity, merging all pages into one extraction result. Best for documents that represent a single entity.

Configuration¶

from docling_graph import PipelineConfig

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="many-to-one"  # Process whole document (default)
)

How It Works¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A@{ shape: terminal, label: "PDF Document" }

    B@{ shape: doc, label: "All Pages" }
    C@{ shape: tag-proc, label: "Chunking" }
    D@{ shape: procs, label: "Extract Chunks" }
    E@{ shape: lin-proc, label: "Merge Results" }

    F@{ shape: procs, label: "Single Model" }

    %% 3. Define Connections
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

    %% 4. Apply Classes
    class A input
    class B data
    class C operator
    class D,E process
    class F output

When to Use Many-to-One¶

✅ Use many-to-one when: - Document is a single entity (e.g., one invoice spanning multiple pages) - Pages are continuation of same content - You need document-wide context - You want a single consolidated result - Document has cross-page relationships

❌ Don't use many-to-one when: - Each page is independent - Pages have different structures - You need page-level tracking - Pages represent separate entities

Example Use Cases¶

Use Case 1: Multi-Page Invoice¶

# Single invoice spanning 3 pages
config = PipelineConfig(
    source="invoice_multipage.pdf",  # 1 invoice, 3 pages
    template="templates.BillingDocument",
    processing_mode="many-to-one"  # Merge all pages
)

Result: 1 BillingDocument model with data from all pages

Use Case 2: Rheology Research¶

# Rheology research with 15 pages
config = PipelineConfig(
    source="research_paper.pdf",  # 1 paper, 15 pages
    template="templates.ScholarlyRheologyPaper",
    processing_mode="many-to-one"  # Single paper entity
)

Result: 1 Research model

Use Case 3: Contract Document¶

# Contract with 20 pages
config = PipelineConfig(
    source="contract.pdf",  # 1 contract, 20 pages
    template="templates.Contract",
    processing_mode="many-to-one"  # Single contract
)

Result: 1 Contract model

Graph Structure¶

Many-to-one creates single root node:

BillingDocument-001 (node)
  ├─ ISSUED_BY → Organization-A
  ├─ SENT_TO → Client-A
  ├─ CONTAINS_LINE → LineItem-1
  ├─ CONTAINS_LINE → LineItem-2
  └─ CONTAINS_LINE → LineItem-3

Performance Characteristics¶

Document: 10-page PDF

Many-to-One Processing:
- Extractions: 1 (whole document)
- Time: ~30 seconds (single extraction)
- Memory: Higher (all pages in context)
- Output: 1 merged model

Detailed Comparison¶

Processing Flow¶

One-to-One Flow¶

1. Convert PDF to pages
2. For each page:
   a. Convert to markdown
   b. Extract with LLM
   c. Create model instance
3. Return list of models

Many-to-One Flow¶

1. Convert PDF to markdown (all pages)
2. Chunk markdown if needed
3. Extract from chunks
4. Merge chunk results
5. Return single model

Context Handling¶

One-to-One Context¶

# Each page has isolated context
Page 1: "Invoice #001, Total: $100"
Page 2: "Invoice #002, Total: $200"
Page 3: "Invoice #003, Total: $300"

# Result: 3 separate invoices

Many-to-One Context¶

# All pages share context
Page 1: "Invoice #001"
Page 2: "Line items continued..."
Page 3: "Total: $1000"

# Result: 1 invoice with all information

Memory Usage¶

Document: 100-page PDF

One-to-One:
- Peak memory: ~2GB (one page at a time)
- Sustained: Low (sequential)

Many-to-One:
- Peak memory: ~8GB (all pages loaded)
- Sustained: High (full document)

Choosing the Right Mode¶

Decision Tree¶

%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TD
    %% 1. Define Classes
    classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
    classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
    classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
    classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
    classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
    classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
    classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238

    %% 2. Define Nodes
    A("Start")

    B{"Are pages<br/>independent?"}
    D{"Single entity<br/>across pages?"}
    F{"Need page-level<br/>tracking?"}

    C@{ shape: tag-proc, label: "One-to-One" }
    E@{ shape: tag-proc, label: "Many-to-One" }

    %% 3. Define Connections
    A --> B

    B -- Yes --> C
    B -- No --> D

    D -- Yes --> E
    D -- No --> F

    F -- Yes --> C
    F -- No --> E

    %% 4. Apply Classes
    class A input
    class B,D,F decision
    class C,E output

By Document Type¶

Document Type	Recommended Mode	Reason
Single Invoice (multi-page)	Many-to-One	Single entity
Batch Invoices (1 per page)	One-to-One	Independent pages
Rheology Research	Many-to-One	Single document
Form Collection	One-to-One	Independent forms
Contract	Many-to-One	Single contract
ID Card Batch	One-to-One	Independent IDs
Report	Many-to-One	Single report
Receipt Stack	One-to-One	Independent receipts

Mode-Specific Configuration¶

One-to-One Configuration¶

config = PipelineConfig(
    source="batch.pdf",
    template="templates.BillingDocument",
    processing_mode="one-to-one",

    # One-to-one specific settings
    export_per_page_markdown=True,  # Export markdown per page
    use_chunking=False  # No chunking needed (pages are small)
)

Many-to-One Configuration¶

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="many-to-one",

    # Many-to-one specific settings
    use_chunking=True,  # Enable chunking for large docs
    llm_consolidation=True,  # Merge results with LLM
    max_batch_size=5  # Process chunks in batches
)

Switching Between Modes¶

From Many-to-One to One-to-One¶

# Original: many-to-one
config_many = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="many-to-one"
)

# Switch to one-to-one
config_one = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode="one-to-one",  # Change mode
    export_per_page_markdown=True  # Add page-specific export
)

From One-to-One to Many-to-One¶

# Original: one-to-one
config_one = PipelineConfig(
    source="batch.pdf",
    template="templates.BillingDocument",
    processing_mode="one-to-one"
)

# Switch to many-to-one
config_many = PipelineConfig(
    source="batch.pdf",
    template="templates.BillingDocument",
    processing_mode="many-to-one",  # Change mode
    use_chunking=True,  # Enable chunking
    llm_consolidation=True  # Enable consolidation
)

Common Patterns¶

📍 Batch Processing with One-to-One¶

# Process batch of documents
config = PipelineConfig(
    source="invoices_batch.pdf",
    template="templates.BillingDocument",
    processing_mode="one-to-one",
    export_per_page_markdown=True
)

# Result: One invoice per page

📍 Single Document with Many-to-One¶

# Process single multi-page document
config = PipelineConfig(
    source="contract.pdf",
    template="templates.Contract",
    processing_mode="many-to-one",
    use_chunking=True,
    llm_consolidation=True
)

# Result: One contract with all pages

📍 Conditional Mode Selection¶

def get_processing_mode(page_count: int, is_batch: bool):
    """Choose mode based on document characteristics."""
    if is_batch:
        return "one-to-one"
    elif page_count > 10:
        return "many-to-one"  # Use chunking for large docs
    else:
        return "many-to-one"  # Small doc, process as one

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode=get_processing_mode(page_count=15, is_batch=False)
)

Best Practices¶

👍 Match Mode to Document Structure¶

# ✅ Good - Mode matches document structure
if document_is_batch:
    mode = "one-to-one"
else:
    mode = "many-to-one"

config = PipelineConfig(
    source="document.pdf",
    template="templates.BillingDocument",
    processing_mode=mode
)

👍 Enable Appropriate Settings¶

# ✅ Good - Settings match mode
if mode == "one-to-one":
    config = PipelineConfig(
        source="document.pdf",
        template="templates.BillingDocument",
        processing_mode="one-to-one",
        export_per_page_markdown=True  # Page-specific
    )
else:
    config = PipelineConfig(
        source="document.pdf",
        template="templates.BillingDocument",
        processing_mode="many-to-one",
        use_chunking=True,  # Document-wide
        llm_consolidation=True
    )

👍 Consider Performance¶

# ✅ Good - Consider document size
if page_count > 50:
    # Large batch: one-to-one might be slow
    print("Warning: Processing 50+ pages individually")

config = PipelineConfig(
    source="large_batch.pdf",
    template="templates.BillingDocument",
    processing_mode="one-to-one"
)

Next Steps¶

Now that you understand processing modes:

Docling Settings → - Configure document conversion
Export Configuration - Set output formats
Configuration Examples - See complete scenarios