Loading Seed Examples from a File

In synthetic data generation, seed examples (also called in-context learning examples) are provided in the prompt given to the teacher model. DiGiT's GenerationTask base class supports two ways to supply them: inline in the task YAML, or from an external file in .jsonl, .json, or .parquet format.

This tutorial continues with the misconceptions databuilder built in Building a Generation Databuilder. The seed examples are currently defined inline in the task YAML at tasks/public/examples/misconceptions/task.yaml:

tasks/public/examples/misconceptions/task.yaml

######################################################
#                   MANDATORY FIELDS
######################################################
task_name: public/examples/misconceptions
task_description: Generate misconception-correction pairs for training a model to identify and correct misinformation.
created_by: IBM

data_builder: public/examples/misconceptions

######################################################
#                   RESERVED FIELDS
######################################################
seed_examples:
  - misconception: Lightning never strikes the same place twice.
    correction: Lightning frequently strikes the same place multiple times. Tall structures like the Empire State Building are struck dozens of times per year.
  - misconception: Humans only use 10 percent of their brains.
    correction: Brain imaging studies show that virtually all regions of the brain are active at some point, and most are active almost all the time.
  - misconception: Swallowed chewing gum stays in your stomach for seven years.
    correction: While gum base is not digestible, it passes through the digestive system and is excreted within a few days, just like other indigestible matter.
  - misconception: Goldfish have a memory span of only three seconds.
    correction: Research has shown that goldfish can remember things for months and can be trained to navigate mazes and recognize their owners.
  - misconception: The Great Wall of China is visible from space with the naked eye.
    correction: The Great Wall is too narrow to be seen from low Earth orbit without aid. Astronauts have confirmed this repeatedly.

Keeping seed examples in the task YAML works well for small sets. For larger collections, or when you want to share seeds across multiple tasks, an external file is easier to manage.

Step 1: Create the seed file

Save the following as data/public/examples/misconceptions/seed_examples.jsonl:

data/public/examples/misconceptions/seed_examples.jsonl

{"misconception": "Lightning never strikes the same place twice.", "correction": "Lightning frequently strikes the same place multiple times. Tall structures like the Empire State Building are struck dozens of times per year."}
{"misconception": "Humans only use 10 percent of their brains.", "correction": "Brain imaging studies show that virtually all regions of the brain are active at some point, and most are active almost all the time."}
{"misconception": "Swallowed chewing gum stays in your stomach for seven years.", "correction": "While gum base is not digestible, it passes through the digestive system and is excreted within a few days, just like other indigestible matter."}
{"misconception": "Goldfish have a memory span of only three seconds.", "correction": "Research has shown that goldfish can remember things for months and can be trained to navigate mazes and recognize their owners."}
{"misconception": "The Great Wall of China is visible from space with the naked eye.", "correction": "The Great Wall is too narrow to be seen from low Earth orbit without aid. Astronauts have confirmed this repeatedly."}

Each line is a JSON object with the same keys (misconception, correction) that the task's instantiate_input_example method expects.

Step 2: Update the task YAML

Replace the seed_examples block with a seed_datastore reference:

tasks/public/examples/misconceptions/task.yaml

######################################################
#                   MANDATORY FIELDS
######################################################
task_name: public/examples/misconceptions
task_description: Generate misconception-correction pairs for training a model to identify and correct misinformation.
created_by: IBM

data_builder: public/examples/misconceptions

######################################################
#                   RESERVED FIELDS
######################################################
seed_datastore:
  type: default
  data_path: ${DGT_DATA_DIR}/public/examples/misconceptions/seed_examples.jsonl

${DGT_DATA_DIR} resolves to the data/ directory at the root of the repository by default. You can override it by setting the environment variable.

Step 3: Run it

The run command is unchanged:

python -m fms_dgt.public \
  --task-paths ./tasks/public/examples/misconceptions/task.yaml \
  --num-outputs-to-generate 20 \
  --restart

DiGiT loads the seed examples from the file at startup and uses them exactly as it would inline examples.

Next steps

To switch to a different LM engine, see Changing the Language Model Engine.
To add a validator that filters low-quality outputs, see Creating a Validator.