Skip to content

Welcome to DGT

Python Version uv Ruff pre-commit GitHub License

DGT (Data Generation and Transformation, pronounced "digit") is a framework for building synthetic data pipelines that generate training data for fine-tuning large language models.

Write a handful of seed examples. Point DiGiT at a model. Get a dataset. Typical runs take under five minutes on a laptop with Ollama.

What it does

High-quality, domain-specific training data is the biggest bottleneck in LLM fine-tuning. DiGiT addresses this by letting you:

  • Generate new examples from a small seed set using any LLM as a teacher model
  • Transform existing data (add chain-of-thought, score for difficulty, reformat, filter)
  • Compose generation and transformation stages into multi-step pipelines

Features

  • 6 LM engines out of the box: Ollama, OpenAI, Azure OpenAI, Anthropic, WatsonX, vLLM — switch with a one-line config change
  • Built-in quality controls: deduplicators, syntactic validators, LLM-as-a-Judge scoring
  • Concurrent execution: async batch requests across all providers for fast throughput
  • Local-first: runs entirely on your machine for sensitive data and air-gapped environments
  • Extensible: add a new databuilder in three files; plug into the same CLI and engine layer

Get started in 5 minutes

# 1. Clone and install
git clone git@github.com:IBM/fms-dgt.git && cd fms-dgt
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[all]"

# 2. Pull a local model (no API key needed)
ollama pull granite4:3b

# 3. Generate 20 geography QA pairs
python -m fms_dgt.public \
  --task-paths ./tasks/public/examples/qa/task.yaml \
  --num-outputs-to-generate 20 \
  --restart

Output: output/public/examples/geography_qa/final_data.jsonl

Ready to do more? See Quick Start for the rater example, cloud provider setup, and CLI reference.