Markdown Input Example¶
Overview¶
This example demonstrates how to process Markdown documents directly, extracting structured data from formatted text without requiring OCR or visual processing.
Time: 10 minutes
Use Case: Documentation Analysis¶
Extract structured information from project documentation, including sections, code examples, and metadata.
Document Source¶
File: README.md or DOCUMENTATION.md
Type: Markdown
Content: Project documentation with sections, code blocks, and structured information.
Template Definition¶
We'll create a template for documentation that captures sections, code examples, and metadata.
from pydantic import BaseModel, Field
from docling_graph.utils import edge
class CodeExample(BaseModel):
"""Code example component."""
model_config = {'is_entity': False}
language: str = Field(description="Programming language")
code: str = Field(description="Code snippet")
description: str = Field(description="What the code does")
class Section(BaseModel):
"""Documentation section entity."""
model_config = {
'is_entity': True,
'graph_id_fields': ['title']
}
title: str = Field(description="Section title")
content: str = Field(description="Section content")
subsections: list[str] = Field(
default_factory=list,
description="Subsection titles"
)
class Documentation(BaseModel):
"""Complete documentation structure."""
model_config = {'is_entity': True}
title: str = Field(description="Document title")
description: str = Field(description="Project description")
version: str | None = Field(
default=None,
description="Documentation version"
)
sections: list[Section] = edge(
"HAS_SECTION",
description="Documentation sections"
)
code_examples: list[CodeExample] = Field(
default_factory=list,
description="Code examples"
)
requirements: list[str] = Field(
default_factory=list,
description="Project requirements"
)
Save as: templates/documentation.py
Processing with CLI¶
Basic Markdown Processing¶
# Process README.md
uv run docling-graph convert README.md \
--template "templates.documentation.Documentation" \
--backend llm \
--inference remote
Important: Markdown files require LLM backend (VLM doesn't support text-only inputs).
With Local LLM¶
# Use local Ollama
uv run docling-graph convert DOCUMENTATION.md \
--template "templates.documentation.Documentation" \
--backend llm \
--inference local \
--provider ollama \
--model llama3.1:8b
With Chunking¶
# Process large markdown with chunking
uv run docling-graph convert LARGE_DOC.md \
--template "templates.documentation.Documentation" \
--backend llm \
--inference remote \
--use-chunking \
--llm-consolidation
Processing with Python API¶
Basic Usage¶
from docling_graph import run_pipeline, PipelineConfig
from templates.documentation import Documentation
# Configure pipeline for Markdown input
config = PipelineConfig(
source="README.md",
template=Documentation,
backend="llm", # Required for text inputs
inference="remote",
processing_mode="many-to-one"
)
# Run pipeline
run_pipeline(config)
Processing Multiple Markdown Files¶
from pathlib import Path
from docling_graph import run_pipeline, PipelineConfig
from templates.documentation import Documentation
# Process all markdown files in a directory
docs_dir = Path("docs")
markdown_files = docs_dir.glob("**/*.md")
for md_file in markdown_files:
print(f"Processing: {md_file}")
config = PipelineConfig(
source=str(md_file),
template=Documentation,
backend="llm",
inference="remote",
processing_mode="many-to-one"
)
try:
run_pipeline(config)
print(f"✅ Completed: {md_file}")
except Exception as e:
print(f"❌ Failed: {md_file} - {e}")
With Custom Provider¶
from docling_graph import run_pipeline, PipelineConfig
from templates.documentation import Documentation
# Use specific LLM provider
config = PipelineConfig(
source="API_DOCS.md",
template=Documentation,
backend="llm",
inference="remote",
provider_override="openai",
model_override="gpt-4-turbo",
use_chunking=True
)
run_pipeline(config)
Expected Output¶
Graph Structure¶
Documentation (root node)
├── HAS_SECTION → Section (Installation)
│ ├── title: "Installation"
│ ├── content: "..."
│ └── subsections: ["Requirements", "Setup"]
├── HAS_SECTION → Section (Usage)
│ ├── title: "Usage"
│ └── content: "..."
├── code_examples (list)
│ ├── CodeExample 1: Python
│ └── CodeExample 2: Bash
└── requirements: ["Python 3.10+", "uv"]
CSV Export¶
nodes.csv:
node_id,node_type,title,description,version
doc_1,Documentation,"Project Name","Description...","1.0.0"
node_id,node_type,title,content
section_installation,Section,"Installation","Installation instructions..."
section_usage,Section,"Usage","Usage guide..."
edges.csv:
source_id,target_id,edge_type
doc_1,section_installation,HAS_SECTION
doc_1,section_usage,HAS_SECTION
Markdown Processing Features¶
What Gets Processed¶
The pipeline extracts: - Headers → Section titles - Paragraphs → Content - Code blocks → Code examples - Lists → Requirements, features - Links → References - Tables → Structured data
Markdown Preservation¶
The original Markdown formatting is preserved in the extracted content, allowing you to: - Maintain code block syntax - Preserve link references - Keep list structures - Retain emphasis and formatting
Text-Only Pipeline¶
Markdown files skip:
❌ OCR (no visual processing needed)
❌ Page segmentation (single text stream)
✅ Direct LLM extraction
✅ Semantic chunking (if enabled)
Troubleshooting¶
🐛 VLM Backend Error¶
Error:
Solution:
# Always use LLM backend for Markdown
uv run docling-graph convert README.md \
--template "templates.documentation.Documentation" \
--backend llm # Required
🐛 Empty File¶
Error:
Solution:
# Ensure file has content
cat README.md # Check file content
file README.md # Verify file type
# If file is empty, add content first
echo "# Documentation" > README.md
🐛 Encoding Problems¶
Error:
Solution:
# Convert file to UTF-8 first
with open("README.md", "r", encoding="latin-1") as f:
content = f.read()
with open("README_utf8.md", "w", encoding="utf-8") as f:
f.write(content)
# Then process
config = PipelineConfig(source="README_utf8.md", ...)
Best Practices¶
👍 Use Descriptive Section Headers¶
✅ Good - Clear hierarchy
# Installation Guide
## Requirements
## Setup Steps
❌ Bad - Unclear structure
# Stuff
## Things
2. Include Code Language Tags¶
❌ Bad - No language
### 3. Structure Content Logically
```markdown
✅ Good - Logical flow
# Overview
# Installation
# Usage
# Examples
# Troubleshooting
❌ Bad - Random order
# Examples
# Overview
# Troubleshooting
# Installation
4. Use Consistent Formatting¶
✅ Good - Consistent style
- Item 1
- Item 2
- Item 3
❌ Bad - Mixed styles
- Item 1
* Item 2
+ Item 3
Advanced Usage¶
Processing Markdown from String¶
from docling_graph import PipelineConfig, run_pipeline
from templates.documentation import Documentation
# Markdown content as string
markdown_content = """
# My Project
## Overview
This is a sample project.
## Features
- Feature 1
- Feature 2
"""
# Process directly (API mode only)
config = PipelineConfig(
source=markdown_content,
template=Documentation,
backend="llm",
inference="remote",
processing_mode="many-to-one"
)
run_pipeline(config, mode="api") # mode="api" required for string input
Combining Multiple Markdown Files¶
from pathlib import Path
# Combine multiple markdown files
md_files = ["intro.md", "guide.md", "reference.md"]
combined_content = "\n\n---\n\n".join(
Path(f).read_text() for f in md_files
)
# Save combined file
Path("combined.md").write_text(combined_content)
# Process combined file
config = PipelineConfig(
source="combined.md",
template=Documentation,
backend="llm",
inference="remote"
)
run_pipeline(config)
Extracting Specific Sections¶
from pydantic import BaseModel, Field
class QuickStart(BaseModel):
"""Extract only quickstart section."""
model_config = {'is_entity': True}
installation: str = Field(description="Installation instructions")
basic_usage: str = Field(description="Basic usage example")
next_steps: list[str] = Field(description="Next steps")
# Process with focused template
config = PipelineConfig(
source="README.md",
template=QuickStart,
backend="llm",
inference="remote"
)
Comparison: Markdown vs PDF¶
| Feature | Markdown | |
|---|---|---|
| OCR Required | ❌ No | ✅ Yes |
| Processing Speed | ⚡ Fast | 🐢 Slower |
| Backend Support | LLM only | LLM + VLM |
| Structure Preservation | ✅ Excellent | ⚠️ Variable |
| Code Blocks | ✅ Native | ⚠️ Extracted |
| Best For | Documentation, Notes | Scanned docs, Forms |
Next Steps¶
- DoclingDocument Input → - Use pre-processed documents
- Input Formats Guide - Complete input format reference
- LLM Backend Configuration - Configure LLM settings