RAG Data Generation
Retrieval-augmented generation (RAG) training data requires conversations where the assistant synthesizes responses from retrieved documents rather than relying on parametric knowledge alone. DiGiT generates these conversations by treating the retriever as a tool: the same tool infrastructure that drives tool-calling data generation handles document retrieval, and the same conversation pipeline manages the multi-turn loop.
Retriever as tool
Rather than building a parallel retrieval-specific subsystem, DiGiT models retrieval as a special case of tool use. A SearchToolEngine is a ToolEngine that returns documents instead of arbitrary API responses. This means:
- The task YAML configures retrieval the same way it configures any other tool engine.
- Document samplers reuse the sampler abstraction already present for tool selection.
- The conversation pipeline, flow controller, and persona machinery are shared with non-RAG databuilders.
The RAG-specific behavior lives entirely in the stages: scenario initialization samples documents and grounds the scenario, and the assistant stage synthesizes from a fixed or live document context rather than calling a general-purpose tool.
Two modes
Static
Documents are sampled once at scenario initialization and injected into every subsequent stage as fixed context. The assistant never issues a retrieval call during the conversation; it synthesizes directly from the document set it was given.
Use static mode when:
- You want to train faithfulness and grounded synthesis without retrieval mechanics.
- Your corpus is small enough to pre-load into memory or a local file.
- You want reproducible conversations tied to a known document set.
Live
The assistant issues a retrieval tool call each turn. The engine executes the query against a live backend (Elasticsearch, or any registered SearchToolEngine) and returns results as ToolCallStep/ToolResultStep pairs in the conversation. The output contains the full retrieval trace.
Use live mode when:
- You want to train the model to formulate retrieval queries.
- You want the training data to reflect the actual retrieval behavior of a production system.
- Your corpus is too large to pre-load.
Components
DocumentSampler ──► SearchToolEngine
│
┌─────┴──────┐
▼ ▼
Static mode Live mode
(inject at init) (call per turn)
SearchToolEngine
A SearchToolEngine wraps a document corpus and exposes a search interface. Three backends are available:
| Type | When to use |
|---|---|
search/in_memory |
Small corpora loaded at startup; fastest, no external dependencies |
search/file |
JSONL corpus on disk; loaded lazily, suitable for medium-sized corpora |
search/elasticsearch |
Large corpora or production-mirroring; requires a running ES cluster |
All three implement the same interface. The projection field maps corpus field names to the internal Document schema (body, doc_id, title, domain).
DocumentSampler
A DocumentSampler selects a subset of documents from the corpus to ground a scenario. It runs during initialization, before any turns are generated.
| Type | Behavior |
|---|---|
search/random |
Uniform random sample, optionally grouped by a corpus field (e.g., domain) |
The group_by field stratifies sampling so each scenario is grounded in documents from a single domain or category, which produces more coherent conversations.
YAML configuration
Static mode (file corpus)
tools:
engines:
file_retriever:
type: search/file
path: ${DGT_DATA_DIR}/public/rag/static/my_corpus/documents.jsonl
format: jsonl
projection:
body: text
doc_id: id
limit: 3
initialization_stages:
- name: lm/scenario/rag
generator: generator
document_samplers:
- type: search/random
engine: file_retriever
group_by: domain
strategy: uniform
weight: 1.0
k: 3
iteration_stages:
- name: lm/flow_controller/rag
generator: generator
patterns: [...]
- name: lm/user/rag
generator: generator
- name: lm/assistant/rag/static
generator: generator
Static mode (in-memory corpus)
tools:
engines:
memory_retriever:
type: search/in_memory
projection:
body: text
doc_id: id
limit: 5
initialization_stages:
- name: lm/scenario/rag
generator: generator
document_samplers:
- type: search/random
engine: memory_retriever
strategy: uniform
weight: 1.0
k: 5
Documents are loaded into the engine at task startup via the corpus field or programmatically. Use this backend for small, static corpora where startup latency is acceptable.
Live mode (Elasticsearch)
tools:
engines:
es_retriever:
type: search/elasticsearch
hosts: ["https://localhost:9200"]
default_index: my_index
projection:
body: content
doc_id: _id
title: title
limit: 5
initialization_stages:
- name: lm/scenario/rag
generator: generator
document_samplers:
- type: search/random
engine: es_retriever
strategy: uniform
weight: 1.0
k: 5
iteration_stages:
- name: lm/flow_controller/rag
generator: generator
patterns: [...]
- name: lm/user/rag
generator: generator
- name: lm/assistant/rag/live
generator: generator
In live mode the assistant stage issues a search tool call each turn. The output includes ToolCallStep and ToolResultStep entries alongside the user and assistant turns.
Corpus format
Documents must be JSONL with at minimum a body field and a unique identifier. Additional fields (title, domain) are optional but improve sampler behavior when using group_by.
{"id": "doc_001", "title": "Vehicle Registration", "domain": "dmv", "text": "Your registration is the sticker placed on your windshield..."}
The projection block in the engine config maps your field names to the internal schema:
projection:
body: text # required — the document text
doc_id: id # required — unique identifier
title: title # optional
domain: domain # optional; used by group_by in samplers
Seed example format
Seed examples are complete conversations used as in-context learning examples by the scenario and stage LMs. Each record follows the ConversationDataPoint schema: a list of steps with typed roles (scenario, persona, flow_controller, user, assistant).
See data/public/rag/static/multi_doc2dial/dmv/seed_examples.jsonl for a reference.
Reading path
| I want to... | Go to |
|---|---|
| Run a working RAG example end to end | RAG: Multi-Doc2Dial |
| Understand the conversation pipeline | Conversation Databuilder |
| Use the tool subsystem directly | Tools |