Skip to content

Welcome to DGT

Python Version uv Ruff pre-commit GitHub License

High-quality data is the backbone of modern AI development, but acquiring diverse, domain-specific, and scalable datasets remains a major bottleneck. Synthetic data generation addresses this challenge by enabling the creation of tailored datasets that are:

  • Cost-effective and privacy-preserving
  • Customizable for specific tasks and domains
  • Scalable to meet evolving model needs

DGT (Data Generation and Transformation) [pronounced "digit"] is a horizontal framework designed to streamline and scale expert, domain-specific synthetic data generation via simplifying and standardizing essential components.

Features

  • ๐Ÿค– Standardize interface for ~5+ different LM engines (WatsonX, OpenAI, Azure OpenAI, vLLM, ollama, anthropic etc.) with retry/fallback logic
  • ๐Ÿ’ก Support for several domain-specific pipelines for tool calling, time series, question answering and more
  • ๐Ÿงช Growing list of syntactic validators, deduplicators, LLMaJs (LLM-as-a-Judge)
  • ๐Ÿ”’ Local execution capabilities for sensitive data and air-gapped environments
  • ๐Ÿค– Plug-and-play [integrations][integrations] incl. Docling
  • ๐Ÿ’ป Simple and convenient CLI