Datastores
A Datastore is DiGiT's storage abstraction. It knows how to read data from a file or in-memory source, write data in a specified format, and report whether previously generated data already exists (for resuming interrupted runs).
Tasks create and manage their own datastores. You do not instantiate datastores directly in most cases. Instead, you configure them through the task YAML and the framework wires them up at runtime.
What gets stored
DiGiT creates the following stores for each task under <output_dir>/<task_name>/:
| Store | File | Contains |
|---|---|---|
data |
data.jsonl |
Records generated during the main loop, appended each iteration |
final_data |
final_data.jsonl |
Post-processed data; identical to data if no postprocessors are defined |
postproc_data_N |
postproc_data_1.jsonl, etc. |
Data between postprocessing epochs |
formatted_data |
formatted_data.jsonl |
Output after a formatter block runs |
task_card |
task_card/task_card.jsonl |
Run configuration and metadata |
task_results |
task_results.jsonl |
Execution metadata: start time, record counts, custom metrics |
logs |
logs.jsonl |
Structured log records for this task |
| Block stores | blocks/<block_name>/*.jsonl |
Records rejected by validator blocks |
The output directory defaults to output/ at the repository root. Set DGT_OUTPUT_DIR to change it, or override per task via runner_config.output_dir in the task YAML.
Configuring a datastore
The default datastore type handles the common cases. Override it in the task YAML under datastore:
To configure where seed examples are loaded from, use seed_datastore:
${DGT_DATA_DIR} resolves to the data/ directory at the repository root by default. You can override it by setting the environment variable.
For transformation tasks, the input dataset is configured under data:
Default datastore
The default datastore (type: default) supports reading from and writing to files in the following formats:
Readable formats: .jsonl, .json, .yaml, .parquet, .csv, .txt, .md, and HuggingFace dataset directories.
Writable formats: jsonl (default), parquet.
Large files (JSONL, Parquet, CSV) are loaded lazily to avoid reading everything into memory at once.
Key configuration fields:
| Field | Default | Description |
|---|---|---|
data_path |
None | File path or glob pattern to read from. Supports ${VAR} expansion. |
output_data_format |
jsonl |
Format for writing output: jsonl or parquet. |
data_formats |
None | Filter loaded files by extension (e.g., [".jsonl", ".json"]). |
data_split |
train |
HuggingFace dataset split name, when loading from a dataset directory. |
Glob patterns are supported in data_path (for example, data/chunks/**.jsonl loads all matching files). Multiple files are concatenated in the order they are found.
Multi-target datastore
The multi_target datastore writes to multiple backends simultaneously. The primary store is the read source; secondary stores receive writes on a best-effort basis (failures are logged but do not stop the run).
formatted_datastore:
type: multi_target
primary:
type: default
output_data_format: jsonl
additional:
- type: default
output_dir: /backup/outputs
output_data_format: parquet
This is useful when you want to archive output to a secondary location (remote storage, a different format) without making it a hard dependency of the run.
Implementing a custom datastore
All datastores inherit from Datastore and must implement:
exists(): returnTrueif previously persisted data is presentclear(): delete all stored datasave_data(data): write records to the backing storeload_iterators(): return a list of iterators over stored recordsload_data(): return all stored recordsclose(): release any open handles
Register with @register_datastore("my_type") and reference by type: my_type in YAML.