Processing Modes¶
Overview¶
Processing modes determine how Docling Graph handles multi-page documents. The choice between one-to-one and many-to-one modes significantly affects extraction results and graph structure.
In this guide: - One-to-one vs many-to-one comparison - When to use each mode - Graph structure differences - Performance implications - Mode-specific configuration
Processing Mode Comparison¶
Quick Comparison¶
| Aspect | One-to-One | Many-to-One |
|---|---|---|
| Processing | Each page separately | Whole document together |
| Output | N models (one per page) | 1 merged model |
| Best For | Independent pages | Single document entity |
| Graph Nodes | More nodes | Fewer nodes |
| Context | Page-level | Document-level |
| Speed | Slower (N extractions) | Faster (1 extraction) |
| Accuracy | Page-specific | Document-wide |
One-to-One Mode¶
What is One-to-One?¶
One-to-one mode processes each page independently, creating separate extraction results for each page. Best for documents where pages are independent entities.
Configuration¶
from docling_graph import PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="one-to-one" # Process each page separately
)
How It Works¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
A@{ shape: terminal, label: "PDF Document" }
B@{ shape: doc, label: "Page 1" }
C@{ shape: doc, label: "Page 2" }
D@{ shape: doc, label: "Page 3" }
E@{ shape: tag-proc, label: "Extract 1" }
F@{ shape: tag-proc, label: "Extract 2" }
G@{ shape: tag-proc, label: "Extract 3" }
H@{ shape: procs, label: "Model 1" }
I@{ shape: procs, label: "Model 2" }
J@{ shape: procs, label: "Model 3" }
%% 3. Define Connections
A --> B & C & D
B --> E
C --> F
D --> G
E --> H
F --> I
G --> J
%% 4. Apply Classes
class A input
class B,C,D data
class E,F,G operator
class H,I,J output
When to Use One-to-One¶
✅ Use one-to-one when: - Each page is an independent document (e.g., batch of invoices) - Pages have different structures - You need page-level granularity - Pages represent separate entities - You want to track which page data came from
❌ Don't use one-to-one when: - Document is a single entity spanning multiple pages - Pages are continuation of same content - You need document-wide context - You want a single consolidated result
Example Use Cases¶
Use Case 1: Batch Invoice Processing¶
# Multiple invoices in one PDF
config = PipelineConfig(
source="invoices_batch.pdf", # 10 invoices, 1 page each
template="templates.BillingDocument",
processing_mode="one-to-one" # Each page is separate invoice
)
Result: 10 BillingDocument models, one per page
Use Case 2: Form Collection¶
# Multiple forms in one PDF
config = PipelineConfig(
source="forms_collection.pdf", # 20 forms
template="templates.ApplicationForm",
processing_mode="one-to-one" # Each page is separate form
)
Result: 20 ApplicationForm models
Use Case 3: ID Card Batch¶
# Multiple ID cards scanned together
config = PipelineConfig(
source="id_cards_batch.pdf", # 50 ID cards
template="templates.IDCard",
processing_mode="one-to-one" # Each page is separate ID
)
Result: 50 IDCard models
Graph Structure¶
One-to-one creates multiple root nodes:
Invoice-Page1 (node)
├─ ISSUED_BY → Organization-A
└─ SENT_TO → Client-A
Invoice-Page2 (node)
├─ ISSUED_BY → Organization-B
└─ SENT_TO → Client-B
Invoice-Page3 (node)
├─ ISSUED_BY → Organization-C
└─ SENT_TO → Client-C
Performance Characteristics¶
Document: 10-page PDF
One-to-One Processing:
- Extractions: 10 (one per page)
- Memory: Moderate (sequential processing)
- Output: 10 separate models
Many-to-One Mode¶
What is Many-to-One?¶
Many-to-one mode processes the entire document as a single entity, merging all pages into one extraction result. Best for documents that represent a single entity.
Configuration¶
from docling_graph import PipelineConfig
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="many-to-one" # Process whole document (default)
)
How It Works¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart LR
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
A@{ shape: terminal, label: "PDF Document" }
B@{ shape: doc, label: "All Pages" }
C@{ shape: tag-proc, label: "Chunking" }
D@{ shape: procs, label: "Extract Chunks" }
E@{ shape: lin-proc, label: "Merge Results" }
F@{ shape: procs, label: "Single Model" }
%% 3. Define Connections
A --> B
B --> C
C --> D
D --> E
E --> F
%% 4. Apply Classes
class A input
class B data
class C operator
class D,E process
class F output
When to Use Many-to-One¶
✅ Use many-to-one when: - Document is a single entity (e.g., one invoice spanning multiple pages) - Pages are continuation of same content - You need document-wide context - You want a single consolidated result - Document has cross-page relationships
❌ Don't use many-to-one when: - Each page is independent - Pages have different structures - You need page-level tracking - Pages represent separate entities
Example Use Cases¶
Use Case 1: Multi-Page Invoice¶
# Single invoice spanning 3 pages
config = PipelineConfig(
source="invoice_multipage.pdf", # 1 invoice, 3 pages
template="templates.BillingDocument",
processing_mode="many-to-one" # Merge all pages
)
Result: 1 BillingDocument model with data from all pages
Use Case 2: Rheology Research¶
# Rheology research with 15 pages
config = PipelineConfig(
source="research_paper.pdf", # 1 paper, 15 pages
template="templates.ScholarlyRheologyPaper",
processing_mode="many-to-one" # Single paper entity
)
Result: 1 Research model
Use Case 3: Contract Document¶
# Contract with 20 pages
config = PipelineConfig(
source="contract.pdf", # 1 contract, 20 pages
template="templates.Contract",
processing_mode="many-to-one" # Single contract
)
Result: 1 Contract model
Graph Structure¶
Many-to-one creates single root node:
BillingDocument-001 (node)
├─ ISSUED_BY → Organization-A
├─ SENT_TO → Client-A
├─ CONTAINS_LINE → LineItem-1
├─ CONTAINS_LINE → LineItem-2
└─ CONTAINS_LINE → LineItem-3
Performance Characteristics¶
Document: 10-page PDF
Many-to-One Processing:
- Extractions: 1 (whole document)
- Time: ~30 seconds (single extraction)
- Memory: Higher (all pages in context)
- Output: 1 merged model
Detailed Comparison¶
Processing Flow¶
One-to-One Flow¶
1. Convert PDF to pages
2. For each page:
a. Convert to markdown
b. Extract with LLM
c. Create model instance
3. Return list of models
Many-to-One Flow¶
1. Convert PDF to markdown (all pages)
2. Chunk markdown if needed
3. Extract from chunks
4. Merge chunk results
5. Return single model
Context Handling¶
One-to-One Context¶
# Each page has isolated context
Page 1: "Invoice #001, Total: $100"
Page 2: "Invoice #002, Total: $200"
Page 3: "Invoice #003, Total: $300"
# Result: 3 separate invoices
Many-to-One Context¶
# All pages share context
Page 1: "Invoice #001"
Page 2: "Line items continued..."
Page 3: "Total: $1000"
# Result: 1 invoice with all information
Memory Usage¶
Document: 100-page PDF
One-to-One:
- Peak memory: ~2GB (one page at a time)
- Sustained: Low (sequential)
Many-to-One:
- Peak memory: ~8GB (all pages loaded)
- Sustained: High (full document)
Choosing the Right Mode¶
Decision Tree¶
%%{init: {'theme': 'redux-dark', 'look': 'default', 'layout': 'elk'}}%%
flowchart TD
%% 1. Define Classes
classDef input fill:#E3F2FD,stroke:#90CAF9,color:#0D47A1
classDef config fill:#FFF8E1,stroke:#FFECB3,color:#5D4037
classDef output fill:#E8F5E9,stroke:#A5D6A7,color:#1B5E20
classDef decision fill:#FFE0B2,stroke:#FFB74D,color:#E65100
classDef data fill:#EDE7F6,stroke:#B39DDB,color:#4527A0
classDef operator fill:#F3E5F5,stroke:#CE93D8,color:#6A1B9A
classDef process fill:#ECEFF1,stroke:#B0BEC5,color:#263238
%% 2. Define Nodes
A("Start")
B{"Are pages<br/>independent?"}
D{"Single entity<br/>across pages?"}
F{"Need page-level<br/>tracking?"}
C@{ shape: tag-proc, label: "One-to-One" }
E@{ shape: tag-proc, label: "Many-to-One" }
%% 3. Define Connections
A --> B
B -- Yes --> C
B -- No --> D
D -- Yes --> E
D -- No --> F
F -- Yes --> C
F -- No --> E
%% 4. Apply Classes
class A input
class B,D,F decision
class C,E output
By Document Type¶
| Document Type | Recommended Mode | Reason |
|---|---|---|
| Single Invoice (multi-page) | Many-to-One | Single entity |
| Batch Invoices (1 per page) | One-to-One | Independent pages |
| Rheology Research | Many-to-One | Single document |
| Form Collection | One-to-One | Independent forms |
| Contract | Many-to-One | Single contract |
| ID Card Batch | One-to-One | Independent IDs |
| Report | Many-to-One | Single report |
| Receipt Stack | One-to-One | Independent receipts |
Mode-Specific Configuration¶
One-to-One Configuration¶
config = PipelineConfig(
source="batch.pdf",
template="templates.BillingDocument",
processing_mode="one-to-one",
# One-to-one specific settings
export_per_page_markdown=True, # Export markdown per page
use_chunking=False # No chunking needed (pages are small)
)
Many-to-One Configuration¶
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="many-to-one",
# Many-to-one specific settings
use_chunking=True, # Enable chunking for large docs
llm_consolidation=True, # Merge results with LLM
max_batch_size=5 # Process chunks in batches
)
Switching Between Modes¶
From Many-to-One to One-to-One¶
# Original: many-to-one
config_many = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="many-to-one"
)
# Switch to one-to-one
config_one = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="one-to-one", # Change mode
export_per_page_markdown=True # Add page-specific export
)
From One-to-One to Many-to-One¶
# Original: one-to-one
config_one = PipelineConfig(
source="batch.pdf",
template="templates.BillingDocument",
processing_mode="one-to-one"
)
# Switch to many-to-one
config_many = PipelineConfig(
source="batch.pdf",
template="templates.BillingDocument",
processing_mode="many-to-one", # Change mode
use_chunking=True, # Enable chunking
llm_consolidation=True # Enable consolidation
)
Common Patterns¶
📍 Batch Processing with One-to-One¶
# Process batch of documents
config = PipelineConfig(
source="invoices_batch.pdf",
template="templates.BillingDocument",
processing_mode="one-to-one",
export_per_page_markdown=True
)
# Result: One invoice per page
📍 Single Document with Many-to-One¶
# Process single multi-page document
config = PipelineConfig(
source="contract.pdf",
template="templates.Contract",
processing_mode="many-to-one",
use_chunking=True,
llm_consolidation=True
)
# Result: One contract with all pages
📍 Conditional Mode Selection¶
def get_processing_mode(page_count: int, is_batch: bool):
"""Choose mode based on document characteristics."""
if is_batch:
return "one-to-one"
elif page_count > 10:
return "many-to-one" # Use chunking for large docs
else:
return "many-to-one" # Small doc, process as one
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode=get_processing_mode(page_count=15, is_batch=False)
)
Best Practices¶
👍 Match Mode to Document Structure¶
# ✅ Good - Mode matches document structure
if document_is_batch:
mode = "one-to-one"
else:
mode = "many-to-one"
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode=mode
)
👍 Enable Appropriate Settings¶
# ✅ Good - Settings match mode
if mode == "one-to-one":
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="one-to-one",
export_per_page_markdown=True # Page-specific
)
else:
config = PipelineConfig(
source="document.pdf",
template="templates.BillingDocument",
processing_mode="many-to-one",
use_chunking=True, # Document-wide
llm_consolidation=True
)
👍 Consider Performance¶
# ✅ Good - Consider document size
if page_count > 50:
# Large batch: one-to-one might be slow
print("Warning: Processing 50+ pages individually")
config = PipelineConfig(
source="large_batch.pdf",
template="templates.BillingDocument",
processing_mode="one-to-one"
)
Next Steps¶
Now that you understand processing modes:
- Docling Settings → - Configure document conversion
- Export Configuration - Set output formats
- Configuration Examples - See complete scenarios