InstructLab: Skills and Knowledge

DiGiT ships two databuilders that implement the LAB method for generating instruction-tuning data. Both are ready to run with the included example tasks.

Skills generates instruction-response pairs for a capability you want to teach, such as writing, editing, or reasoning.
Knowledge generates question-answer pairs grounded in a reference document, for teaching factual or domain-specific content.

Note

These databuilders are inspired by InstructLab's implementation of the LAB method and are intended for demonstration and experimentation. They may differ from InstructLab's production pipeline.

Skills: freeform debate generation

This example generates debate-style responses where the model argues multiple perspectives on a topic.

Run it

python -m fms_dgt.public \
  --task-paths ./tasks/public/instructlab/skills/writing/freeform/debate/task.yaml \
  --num-outputs-to-generate 20 \
  --restart

Task specification

The task file is at tasks/public/instructlab/skills/writing/freeform/debate/task.yaml. Each seed example provides a question (the debate prompt) and an answer (a multi-perspective response):

task_name: instructlab/skills/writing/freeform/debate
task_description: To teach a language model to formulate debate points
created_by: IBM Research
data_builder: instructlab/skills

seed_examples:
  - question: >
      Debate the merits and drawbacks of implementing a universal basic income
      between an economist, a sociologist, and a policy maker.
    answer: >
      Economist: "Implementing a universal basic income (UBI) could significantly
      reduce poverty rates and provide a financial safety net..."

      Sociologist: "From a sociological perspective, UBI has the potential to
      address income inequality and promote social cohesion..."

      Policy Maker: "As a policy maker, I see the appeal of UBI in its potential
      to alleviate poverty and simplify the welfare system..."

How it works

The skills databuilder runs a multi-stage pipeline:

Generation: the model produces new question-answer pairs in the style of the seed examples.
Validation: an LM judge scores each pair for coherence and relevance, filtering out low-quality outputs.
Tagging and deduplication: each output is tagged for difficulty and quality, and near-duplicates are removed.

Sample output

Generated data lands in output/instructlab/skills/writing/freeform/debate/final_data.jsonl:

{
  "task_name": "instructlab/skills/writing/freeform/debate",
  "is_seed": false,
  "instruction": "Discuss the pros and cons of remote work from the perspective of an employee, a manager, and an HR professional.",
  "input": "",
  "output": "Employee: Remote work offers flexibility and eliminates commuting time, boosting productivity for self-motivated individuals...\n\nManager: Managing remote teams requires strong communication practices and clear goal-setting...\n\nHR Professional: From a talent acquisition standpoint, remote-first policies significantly expand the candidate pool..."
}

Knowledge: photosynthesis QA generation

This example generates question-answer pairs grounded in a biology document on photosynthesis.

Prerequisites

The example requires the photosynthesis document, which ships with the repository at:

data/public/instructlab/knowledge/textbook/science/biology/photosynthesis/photosynthesis.md

No download required.

Run it

python -m fms_dgt.public \
  --task-paths ./tasks/public/instructlab/knowledge/textbook/science/biology/photosynthesis/task.yaml \
  --restart

Task specification

The task file is at tasks/public/instructlab/knowledge/textbook/science/biology/photosynthesis/task.yaml. Key fields:

task_name: instructlab/knowledge/textbook/science/biology/photosynthesis
task_description: To teach a language model about photosynthesis
created_by: IBM Research
data_builder: instructlab/knowledge

seed_examples:
  - question: What is respiration?
    answer: The word respiration is commonly used to describe the process of breathing in oxygen and breathing out carbon dioxide.
  - question: What is an ecosystem?
    answer: An ecosystem is a community of organisms and their physical environment interacting together.
  - question: What is metabolism?
    answer: Metabolism is the chemical reactions in the body's cells that change food into energy.

include:
  documents:
    photosynthesis: ${DGT_DATA_DIR}/public/instructlab/knowledge/textbook/science/biology/photosynthesis/photosynthesis.md

domain: biology
chunk_size: 800
question_style: FRQ
criteria:
  - faithfulness
  - relevancy
  - question_verification

The include.documents directive loads the reference document. DiGiT chunks it automatically according to chunk_size (in tokens) and uses each chunk as grounding context for question generation.

How it works

The knowledge databuilder runs a multi-stage pipeline:

Generation: the model generates questions grounded in each document chunk, using the seed examples as style references.
Validation: an LM judge checks each QA pair against the document for faithfulness and relevancy, and verifies the question is answerable.
Tagging and deduplication: outputs are tagged and near-duplicates are removed.

Sample output

Generated data lands in output/instructlab/knowledge/textbook/science/biology/photosynthesis/final_data.jsonl:

{
  "task_name": "instructlab/knowledge/textbook/science/biology/photosynthesis",
  "is_seed": false,
  "question": "What role does photosynthesis play in the global carbon cycle?",
  "answer": "Photosynthesis removes carbon dioxide from the atmosphere and converts it into carbohydrates stored in plant tissue. This process counteracts the carbon dioxide released by burning fossil fuels, making photosynthesis a critical regulator of atmospheric carbon and therefore of global climate.",
  "domain": "biology",
  "context": "..."
}

Using a different model or provider

Both databuilders default to granite4:3b via Ollama. To switch to a cloud provider, pass a config file:

python -m fms_dgt.public \
  --task-paths ./tasks/public/instructlab/skills/writing/freeform/debate/task.yaml \
  --config-path ./configs/public/instructlab/watsonx_skills.yaml \
  --num-outputs-to-generate 20 \
  --restart

See Changing the LM Engine for details on all supported providers.

Next steps

Add your own documents to the knowledge databuilder by creating a new task YAML under tasks/public/instructlab/knowledge/.
Add your own skills by creating a new task YAML under tasks/public/instructlab/skills/.
Read the Skills README and Knowledge README for the full list of task parameters.