Skip to content

Quick Start

If you just cloned the repo and ran the geography QA example from the home page, you are already set up. This page covers what comes next: transforming data, switching to a cloud provider, and the full CLI reference.

Rate the generated data

DiGiT can also transform existing data. The rater example scores each geography QA pair for difficulty using an LLM:

python -m fms_dgt.public \
  --task-paths ./tasks/public/examples/rate/task.yaml \
  --restart

Output lands in output/public/examples/qa_ratings/final_data.jsonl. Each record adds a rating field to the original QA pair:

{
  "task_name": "public/examples/qa_ratings",
  "is_seed": false,
  "question": "What is the deepest lake in the world?",
  "answer": "Lake Baikal in Siberia, Russia, is the deepest lake in the world.",
  "rating": 2
}

Using a cloud provider

Pass a --config-path to override the LM engine without editing any YAML files. For OpenAI:

export OPENAI_API_KEY=your-api-key

python -m fms_dgt.public \
  --task-paths ./tasks/public/examples/qa/task.yaml \
  --config-path ./configs/public/examples/openai_qa.yaml \
  --num-outputs-to-generate 20 \
  --restart

See Changing the Language Model Engine for WatsonX, Anthropic, and other providers.

CLI reference

python -m fms_dgt.public --help

Flags

Flag Description
--task-paths Path(s) to task YAML files
--config-path Override LM engine or model without editing the builder YAML
--num-outputs-to-generate Number of synthetic examples to produce per generation task
--restart Discard previous output and start fresh
--output-dir Directory to write generated data (overrides DGT_OUTPUT_DIR)
--data-dir Directory to load input data from (overrides DGT_DATA_DIR)
--include-namespaces Additional databuilder namespaces to load (e.g. public)

Environment variables

Variable Default Description
DGT_DATA_DIR data/ Root directory for input data files referenced by ${DGT_DATA_DIR} in task YAMLs
DGT_OUTPUT_DIR output/ Root directory for all generated output, logs, and task cards
DGT_CACHE_DIR .cache/ Root directory for enrichment cache files (tool schemas, embeddings, neighbor graphs). See Tool Enrichment Cache
DGT_TELEMETRY_DIR telemetry/ Directory for events.jsonl and traces.jsonl telemetry files
DGT_TELEMETRY_DISABLE (unset) Set to any non-empty value to disable telemetry file writing entirely
DGT_TELEMETRY_RECORD_PAYLOADS (unset) Set to 1 to include prompts and completions in telemetry spans (see Observability)
DGT_TELEMETRY_PAYLOAD_MAX_CHARS 4096 Maximum characters per payload field when payload recording is enabled

Tool Enrichment Cache

When a task declares tool enrichments (output_parameters, embeddings, neighbors), DiGiT caches the results so subsequent runs do not re-pay the cost of LLM calls or embedding passes.

Cache files are written under:

{DGT_CACHE_DIR}/enrichments/{type}/{fingerprint}.json

The fingerprint is a SHA-256 hash of the tool set and enrichment config (model ID, keep_k, etc.). Two tasks that use the same tools and the same enrichment config will hit the same cache file automatically — no explicit sharing configuration is needed. REST and MCP tools are cached identically to file-backed tools; the cache is keyed by qualified tool name, not by source URL or file path.

Delta-merge: if you add new tools to a registry, only the new tools are computed and appended. Existing cache entries are not discarded.

Force refresh: set force: true on an enrichment in the task YAML to bypass the cache load and recompute from scratch (useful when tool descriptions have changed without qualified names changing):

enrichments:
  - type: output_parameters
    force: true
    lm_config:
      type: ollama
      model_id_or_path: granite3.3:8b

Override the cache root by setting DGT_CACHE in your environment or .env file.

Next steps