Markdown Input Example¶

Overview¶

This example demonstrates how to process Markdown documents directly, extracting structured data from formatted text without requiring OCR or visual processing.

Time: 10 minutes

Use Case: Documentation Analysis¶

Extract structured information from project documentation, including sections, code examples, and metadata.

Document Source¶

File: README.md or DOCUMENTATION.md

Type: Markdown

Content: Project documentation with sections, code blocks, and structured information.

Template Definition¶

We'll create a template for documentation that captures sections, code examples, and metadata.

from pydantic import BaseModel, Field
from docling_graph.utils import edge

class CodeExample(BaseModel):
    """Code example component."""
    model_config = {'is_entity': False}

    language: str = Field(description="Programming language")
    code: str = Field(description="Code snippet")
    description: str = Field(description="What the code does")

class Section(BaseModel):
    """Documentation section entity."""
    model_config = {
        'is_entity': True,
        'graph_id_fields': ['title']
    }

    title: str = Field(description="Section title")
    content: str = Field(description="Section content")
    subsections: list[str] = Field(
        default_factory=list,
        description="Subsection titles"
    )

class Documentation(BaseModel):
    """Complete documentation structure."""
    model_config = {'is_entity': True}

    title: str = Field(description="Document title")
    description: str = Field(description="Project description")
    version: str | None = Field(
        default=None,
        description="Documentation version"
    )
    sections: list[Section] = edge(
        "HAS_SECTION",
        description="Documentation sections"
    )
    code_examples: list[CodeExample] = Field(
        default_factory=list,
        description="Code examples"
    )
    requirements: list[str] = Field(
        default_factory=list,
        description="Project requirements"
    )

Save as: templates/documentation.py

Processing with CLI¶

Basic Markdown Processing¶

# Process README.md
uv run docling-graph convert README.md \
    --template "templates.documentation.Documentation" \
    --backend llm \
    --inference remote

Important: Markdown files require LLM backend (VLM doesn't support text-only inputs).

With Local LLM¶

# Use local Ollama
uv run docling-graph convert DOCUMENTATION.md \
    --template "templates.documentation.Documentation" \
    --backend llm \
    --inference local \
    --provider ollama \
    --model llama3.1:8b

With Chunking¶

# Process large markdown with chunking
uv run docling-graph convert LARGE_DOC.md \
    --template "templates.documentation.Documentation" \
    --backend llm \
    --inference remote \
    --use-chunking \
    --llm-consolidation

Processing with Python API¶

Basic Usage¶

from docling_graph import run_pipeline, PipelineConfig
from templates.documentation import Documentation

# Configure pipeline for Markdown input
config = PipelineConfig(
    source="README.md",
    template=Documentation,
    backend="llm",  # Required for text inputs
    inference="remote",
    processing_mode="many-to-one"
)

# Run pipeline
run_pipeline(config)

Processing Multiple Markdown Files¶

from pathlib import Path
from docling_graph import run_pipeline, PipelineConfig
from templates.documentation import Documentation

# Process all markdown files in a directory
docs_dir = Path("docs")
markdown_files = docs_dir.glob("**/*.md")

for md_file in markdown_files:
    print(f"Processing: {md_file}")

    config = PipelineConfig(
        source=str(md_file),
        template=Documentation,
        backend="llm",
        inference="remote",
        processing_mode="many-to-one"
    )

    try:
        run_pipeline(config)
        print(f"✅ Completed: {md_file}")
    except Exception as e:
        print(f"❌ Failed: {md_file} - {e}")

With Custom Provider¶

from docling_graph import run_pipeline, PipelineConfig
from templates.documentation import Documentation

# Use specific LLM provider
config = PipelineConfig(
    source="API_DOCS.md",
    template=Documentation,
    backend="llm",
    inference="remote",
    provider_override="openai",
    model_override="gpt-4-turbo",
    use_chunking=True
)

run_pipeline(config)

Expected Output¶

Graph Structure¶

Documentation (root node)
├── HAS_SECTION → Section (Installation)
│   ├── title: "Installation"
│   ├── content: "..."
│   └── subsections: ["Requirements", "Setup"]
├── HAS_SECTION → Section (Usage)
│   ├── title: "Usage"
│   └── content: "..."
├── code_examples (list)
│   ├── CodeExample 1: Python
│   └── CodeExample 2: Bash
└── requirements: ["Python 3.10+", "uv"]

CSV Export¶

nodes.csv:

node_id,node_type,title,description,version
doc_1,Documentation,"Project Name","Description...","1.0.0"

node_id,node_type,title,content
section_installation,Section,"Installation","Installation instructions..."
section_usage,Section,"Usage","Usage guide..."

edges.csv:

source_id,target_id,edge_type
doc_1,section_installation,HAS_SECTION
doc_1,section_usage,HAS_SECTION

Markdown Processing Features¶

What Gets Processed¶

The pipeline extracts: - Headers → Section titles - Paragraphs → Content - Code blocks → Code examples - Lists → Requirements, features - Links → References - Tables → Structured data

Markdown Preservation¶

The original Markdown formatting is preserved in the extracted content, allowing you to: - Maintain code block syntax - Preserve link references - Keep list structures - Retain emphasis and formatting

Text-Only Pipeline¶

Markdown files skip:
❌ OCR (no visual processing needed)
❌ Page segmentation (single text stream)
✅ Direct LLM extraction
✅ Semantic chunking (if enabled)

Troubleshooting¶

🐛 VLM Backend Error¶

Error:

ExtractionError: VLM backend does not support text-only inputs

Solution:

# Always use LLM backend for Markdown
uv run docling-graph convert README.md \
    --template "templates.documentation.Documentation" \
    --backend llm  # Required

🐛 Empty File¶

Error:

ValidationError: Text input is empty

Solution:

# Ensure file has content
cat README.md  # Check file content
file README.md  # Verify file type

# If file is empty, add content first
echo "# Documentation" > README.md

🐛 Encoding Problems¶

Error:

ValidationError: Failed to read text file: encoding error

Solution:

# Convert file to UTF-8 first
with open("README.md", "r", encoding="latin-1") as f:
    content = f.read()

with open("README_utf8.md", "w", encoding="utf-8") as f:
    f.write(content)

# Then process
config = PipelineConfig(source="README_utf8.md", ...)

Best Practices¶

👍 Use Descriptive Section Headers¶

✅ Good - Clear hierarchy
# Installation Guide
## Requirements
## Setup Steps

❌ Bad - Unclear structure
# Stuff
## Things

2. Include Code Language Tags¶

✅ Good - Language specified
```python
def hello():
    print("Hello")

❌ Bad - No language

def hello():
    print("Hello")

### 3. Structure Content Logically

```markdown
✅ Good - Logical flow
# Overview
# Installation
# Usage
# Examples
# Troubleshooting

❌ Bad - Random order
# Examples
# Overview
# Troubleshooting
# Installation

4. Use Consistent Formatting¶

✅ Good - Consistent style
- Item 1
- Item 2
- Item 3

❌ Bad - Mixed styles
- Item 1
* Item 2
+ Item 3

Advanced Usage¶

Processing Markdown from String¶

from docling_graph import PipelineConfig, run_pipeline
from templates.documentation import Documentation

# Markdown content as string
markdown_content = """
# My Project

## Overview
This is a sample project.

## Features
- Feature 1
- Feature 2
"""

# Process directly (API mode only)
config = PipelineConfig(
    source=markdown_content,
    template=Documentation,
    backend="llm",
    inference="remote",
    processing_mode="many-to-one"
)

run_pipeline(config, mode="api")  # mode="api" required for string input

Combining Multiple Markdown Files¶

from pathlib import Path

# Combine multiple markdown files
md_files = ["intro.md", "guide.md", "reference.md"]
combined_content = "\n\n---\n\n".join(
    Path(f).read_text() for f in md_files
)

# Save combined file
Path("combined.md").write_text(combined_content)

# Process combined file
config = PipelineConfig(
    source="combined.md",
    template=Documentation,
    backend="llm",
    inference="remote"
)
run_pipeline(config)

Extracting Specific Sections¶

from pydantic import BaseModel, Field

class QuickStart(BaseModel):
    """Extract only quickstart section."""
    model_config = {'is_entity': True}

    installation: str = Field(description="Installation instructions")
    basic_usage: str = Field(description="Basic usage example")
    next_steps: list[str] = Field(description="Next steps")

# Process with focused template
config = PipelineConfig(
    source="README.md",
    template=QuickStart,
    backend="llm",
    inference="remote"
)

Comparison: Markdown vs PDF¶

Feature	Markdown	PDF
OCR Required	❌ No	✅ Yes
Processing Speed	⚡ Fast	🐢 Slower
Backend Support	LLM only	LLM + VLM
Structure Preservation	✅ Excellent	⚠️ Variable
Code Blocks	✅ Native	⚠️ Extracted
Best For	Documentation, Notes	Scanned docs, Forms

Next Steps¶

DoclingDocument Input → - Use pre-processed documents
Input Formats Guide - Complete input format reference
LLM Backend Configuration - Configure LLM settings