Skip to content

Datastores

A Datastore is DiGiT's storage abstraction. It knows how to read data from a file or in-memory source, write data in a specified format, and report whether previously generated data already exists (for resuming interrupted runs).

Tasks create and manage their own datastores. You do not instantiate datastores directly in most cases. Instead, you configure them through the task YAML and the framework wires them up at runtime.

What gets stored

DiGiT creates the following stores for each task under <output_dir>/<task_name>/:

Store File Contains
data data.jsonl Records generated during the main loop, appended each iteration
final_data final_data.jsonl Post-processed data; identical to data if no postprocessors are defined
postproc_data_N postproc_data_1.jsonl, etc. Data between postprocessing epochs
formatted_data formatted_data.jsonl Output after a formatter block runs
task_card task_card/task_card.jsonl Run configuration and metadata
task_results task_results.jsonl Execution metadata: start time, record counts, custom metrics
logs logs.jsonl Structured log records for this task
Block stores blocks/<block_name>/*.jsonl Records rejected by validator blocks

The output directory defaults to output/ at the repository root. Set DGT_OUTPUT_DIR to change it, or override per task via runner_config.output_dir in the task YAML.

Configuring a datastore

The default datastore type handles the common cases. Override it in the task YAML under datastore:

datastore:
  type: default
  output_data_format: parquet # write final data as Parquet instead of JSONL

To configure where seed examples are loaded from, use seed_datastore:

seed_datastore:
  type: default
  data_path: ${DGT_DATA_DIR}/public/examples/qa/seed_examples.jsonl

${DGT_DATA_DIR} resolves to the data/ directory at the repository root by default. You can override it by setting the environment variable.

For transformation tasks, the input dataset is configured under data:

data:
  type: default
  data_path: ${DGT_DATA_DIR}/public/examples/fact_triples/paragraphs.jsonl

Default datastore

The default datastore (type: default) supports reading from and writing to files in the following formats:

Readable formats: .jsonl, .json, .yaml, .parquet, .csv, .txt, .md, and HuggingFace dataset directories.

Writable formats: jsonl (default), parquet.

Large files (JSONL, Parquet, CSV) are loaded lazily to avoid reading everything into memory at once.

Key configuration fields:

Field Default Description
data_path None File path or glob pattern to read from. Supports ${VAR} expansion.
output_data_format jsonl Format for writing output: jsonl or parquet.
data_formats None Filter loaded files by extension (e.g., [".jsonl", ".json"]).
data_split train HuggingFace dataset split name, when loading from a dataset directory.

Glob patterns are supported in data_path (for example, data/chunks/**.jsonl loads all matching files). Multiple files are concatenated in the order they are found.

Multi-target datastore

The multi_target datastore writes to multiple backends simultaneously. The primary store is the read source; secondary stores receive writes on a best-effort basis (failures are logged but do not stop the run).

formatted_datastore:
  type: multi_target
  primary:
    type: default
    output_data_format: jsonl
  additional:
    - type: default
      output_dir: /backup/outputs
      output_data_format: parquet

This is useful when you want to archive output to a secondary location (remote storage, a different format) without making it a hard dependency of the run.

Implementing a custom datastore

All datastores inherit from Datastore and must implement:

  • exists(): return True if previously persisted data is present
  • clear(): delete all stored data
  • save_data(data): write records to the backing store
  • load_iterators(): return a list of iterators over stored records
  • load_data(): return all stored records
  • close(): release any open handles

Register with @register_datastore("my_type") and reference by type: my_type in YAML.