Instruction following

Instruction Following¶

This notebook provides a demonstration of building a benchmark to use steering to improve instruction following on the Split-IFEval dataset. This notebook compares four steering pipelines: the unsteered behavior (baseline model), PASTA, DeAL, and ThinkingIntervention.

Building the use-case¶

The instruction following use-case has already been constructed tutorial and is available at aisteer360/evaluation/use_cases/instruction_following/use_case.py. For details on how to construct use cases, please see the tutorial. The instruction following use case is initialized as follows:

In [1]:

Copied!





from aisteer360.evaluation.use_cases.instruction_following import InstructionFollowing
from aisteer360.evaluation.metrics.custom.instruction_following.strict_instruction import StrictInstruction
from datasets import load_dataset
import json

# load the dataset and look at one example
dataset = load_dataset("ibm-research/Split-IFEval", split="train")
evaluation_data = dataset.to_list()
print(json.dumps(evaluation_data[0], indent=2))

# Define the instruction following use-case
# Provide the evaluation data and the metric to be used.
instruction_following = InstructionFollowing(
    evaluation_data=evaluation_data,
    evaluation_metrics=[StrictInstruction()],
    num_samples=50  # optional
)
from aisteer360.evaluation.use_cases.instruction_following import InstructionFollowing
from aisteer360.evaluation.metrics.custom.instruction_following.strict_instruction import StrictInstruction
from datasets import load_dataset
import json

# load the dataset and look at one example
dataset = load_dataset("ibm-research/Split-IFEval", split="train")
evaluation_data = dataset.to_list()
print(json.dumps(evaluation_data[0], indent=2))

# Define the instruction following use-case
# Provide the evaluation data and the metric to be used.
instruction_following = InstructionFollowing(
    evaluation_data=evaluation_data,
    evaluation_metrics=[StrictInstruction()],
    num_samples=50  # optional
)

{
  "key": 1000,
  "prompt": "Write a summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\".\n\nYour response should follow the instructions below:\n- Write 300+ words\n- Do not use any commas\n- Highlight at least 3 sections that have titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*",
  "instruction_id_list": [
    "punctuation:no_comma",
    "detectable_format:number_highlighted_sections",
    "length_constraints:number_words"
  ],
  "kwargs": [
    {
      "num_bullets": null,
      "num_highlights": null,
      "relation": null,
      "num_words": null,
      "capital_relation": null,
      "capital_frequency": null,
      "num_sentences": null,
      "end_phrase": null,
      "keyword": null,
      "frequency": null,
      "prompt_to_repeat": null,
      "first_word": null,
      "num_paragraphs": null,
      "nth_paragraph": null,
      "let_relation": null,
      "letter": null,
      "let_frequency": null,
      "section_spliter": null,
      "num_sections": null,
      "postscript_marker": null,
      "forbidden_words": null,
      "num_placeholders": null,
      "language": null,
      "keywords": null
    },
    {
      "num_bullets": null,
      "num_highlights": 3,
      "relation": null,
      "num_words": null,
      "capital_relation": null,
      "capital_frequency": null,
      "num_sentences": null,
      "end_phrase": null,
      "keyword": null,
      "frequency": null,
      "prompt_to_repeat": null,
      "first_word": null,
      "num_paragraphs": null,
      "nth_paragraph": null,
      "let_relation": null,
      "letter": null,
      "let_frequency": null,
      "section_spliter": null,
      "num_sections": null,
      "postscript_marker": null,
      "forbidden_words": null,
      "num_placeholders": null,
      "language": null,
      "keywords": null
    },
    {
      "num_bullets": null,
      "num_highlights": null,
      "relation": "at least",
      "num_words": 300,
      "capital_relation": null,
      "capital_frequency": null,
      "num_sentences": null,
      "end_phrase": null,
      "keyword": null,
      "frequency": null,
      "prompt_to_repeat": null,
      "first_word": null,
      "num_paragraphs": null,
      "nth_paragraph": null,
      "let_relation": null,
      "letter": null,
      "let_frequency": null,
      "section_spliter": null,
      "num_sections": null,
      "postscript_marker": null,
      "forbidden_words": null,
      "num_placeholders": null,
      "language": null,
      "keywords": null
    }
  ],
  "separated_prompt": "Write a summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\".",
  "instructions": [
    "- Write 300+ words",
    "- Do not use any commas",
    "- Highlight at least 3 sections that have titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*"
  ],
  "original_prompt": "Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*."
}

The goal of this use case is to evaluate the ability of LLMs to follow instructions provided in the prompt. From the example above, we see that the task "Write a 300+ word summary of the wikipedia page..." contains various instructions for the model (requiring a response of more than 300 words, with no commas, and at least 3 highlighted sections).

Note: The original IFEval dataset specifies instructions within the prompt/task itself, making it difficult to evaluate certain steering methods (e.g., PASTA which requires steering attentions on only the instruction tokens). As a result, we have modified the dataset to extract the instructions for each prompt/task (see https://huggingface.co/datasets/ibm-research/Split-IFEval for the modified dataset).

We use the IFEval metric, StrictInstruction, which returns two values:

prompt-level accuracy: measures the percentage of instances where all instructions were followed, and
instruction-level accuracy: measures the overall instruction following accuracy as a percentage.

Defining the controls¶

This demonstration aims to compare the baseline model prediction against three steering methods: PASTA, DeAL, and ThinkingIntervention.

Defining the PASTA control¶

For PASTA, we choose to apply the attention steering bias to layers 8 and 9, and set the amount of bias to 0.01. We also assign scale_position=exclude to indicate that attention should be scaled away from the target substrings.

In [2]:

Copied!





from aisteer360.algorithms.state_control.pasta.control import PASTA
pasta = PASTA(
        head_config=[8,9],
        alpha=0.01,
        scale_position="exclude",
    )
from aisteer360.algorithms.state_control.pasta.control import PASTA
pasta = PASTA(
        head_config=[8,9],
        alpha=0.01,
        scale_position="exclude",
    )

Defining the DeAL control¶

With DeAL, we define the number of lookahead tokens and beams. At each step we choose the top-4 beams (based on the reward scores), and repeat this for at most max_iterations=10.

We use the StrictInstruction metric as the reward function which rewards beams that are most aligned with the instructions.

In [3]:

Copied!





from aisteer360.evaluation.metrics.custom.instruction_following.strict_instruction import StrictInstruction


def strict_reward(prompt: str, responses: list[str], params: dict) -> list[float]:
    """DeAL reward function based on IFEval's strict instruction metric.
    Used for beam selection in the instruction following benchmark use-case.

    Args:
        prompt (str): Input + generation until current step
        responses (list[str]): Beam responses for the given prompt
        params (dict): DeAL parameters and input dataset fields

    Returns:
        list[float]: Reward scores for each beam
    """
    metric = StrictInstruction()
    assert all(
        key in params for key in ["instructions", "instruction_id_list", "kwargs"]
    ), f"Missing parameters for evaluation"
    accuracies = [
        metric.compute(
            predictions=[
                {
                    "prompt": prompt,
                    "response": response,
                    "instruction_id_list": params["instruction_id_list"],
                    "instructions": params["instructions"],
                    "kwargs": params["kwargs"],
                }
            ]
        )["strict_instruction_accuracy"]
        for response in responses
    ]
    return accuracies
from aisteer360.evaluation.metrics.custom.instruction_following.strict_instruction import StrictInstruction


def strict_reward(prompt: str, responses: list[str], params: dict) -> list[float]:
    """DeAL reward function based on IFEval's strict instruction metric.
    Used for beam selection in the instruction following benchmark use-case.

    Args:
        prompt (str): Input + generation until current step
        responses (list[str]): Beam responses for the given prompt
        params (dict): DeAL parameters and input dataset fields

    Returns:
        list[float]: Reward scores for each beam
    """
    metric = StrictInstruction()
    assert all(
        key in params for key in ["instructions", "instruction_id_list", "kwargs"]
    ), f"Missing parameters for evaluation"
    accuracies = [
        metric.compute(
            predictions=[
                {
                    "prompt": prompt,
                    "response": response,
                    "instruction_id_list": params["instruction_id_list"],
                    "instructions": params["instructions"],
                    "kwargs": params["kwargs"],
                }
            ]
        )["strict_instruction_accuracy"]
        for response in responses
    ]
    return accuracies

In [4]:

Copied!





from aisteer360.algorithms.output_control.deal.control import DeAL

deal = DeAL(
        lookahead=50, 
        init_beams=8, 
        topk=2, 
        max_iterations=20, 
        reward_func=strict_reward
    )
from aisteer360.algorithms.output_control.deal.control import DeAL

deal = DeAL(
        lookahead=50, 
        init_beams=8, 
        topk=2, 
        max_iterations=20, 
        reward_func=strict_reward
    )

Defining the Thinking Intervention control¶

The ThinkingIntervention control adds a task-specific string (an intervention) to the model response just before generation. For this use case, we have defined the following intervention function to remind the model to follow the specified instructions.

In [6]:

Copied!





def instruction_following_intervention(prompt: str, params: dict) -> str:
        intervention = (
            "I will first think using the <think> and </think> tags and then provide the final answer after that.\n"
            "<think> I should ensure that the answer follows these instructions. "
        )
        modified_instr = [instr.replace("-", "") for instr in params["instructions"]]
        intervention += " and".join(modified_instr)
        return prompt + intervention + "\n"
def instruction_following_intervention(prompt: str, params: dict) -> str:
        intervention = (
            "I will first think using the  and  tags and then provide the final answer after that.\n"
            " I should ensure that the answer follows these instructions. "
        )
        modified_instr = [instr.replace("-", "") for instr in params["instructions"]]
        intervention += " and".join(modified_instr)
        return prompt + intervention + "\n"

Now we can define the thinking intervention control:

In [7]:

Copied!

from aisteer360.algorithms.output_control.thinking_intervention.control import ThinkingIntervention

thinking_intervention = ThinkingIntervention(
    intervention=instruction_following_intervention
)
from aisteer360.algorithms.output_control.thinking_intervention.control import ThinkingIntervention

thinking_intervention = ThinkingIntervention(
    intervention=instruction_following_intervention
)

Instantiating (and running) the benchmark¶

The benchmark is instantiated with the instruction_following use case and the three controls (above) plus a baseline (empty pipeline). The base model is Qwen/Qwen2.5-1.5B-Instruct.

Note: PASTA, DeAL and ThinkingIntervention each require specific arguments in order to execute. For instance, PASTA needs to know which substrings to emphasize (the instructions in the dataset for this particular use-case), DeAL requires all the keys from the input dataset needed to compute the reward, and the ThinkingIntervention helper requires the instructions from the params argument. These arguments, termed runtime_kwargs, are populated from the IFEval dataset via the runtime_overrides parameter (which tells the benchmark how to populate the runtime_kwargs from the columns of the evaluation dataset).

In [8]:

Copied!





from aisteer360.evaluation.benchmark import Benchmark
from transformers import logging
logging.set_verbosity_error()

benchmark = Benchmark(
        use_case=instruction_following,
        base_model_name_or_path="Qwen/Qwen2.5-1.5B-Instruct",
        steering_pipelines={
            "baseline": [], # no steering
            "pasta": [pasta],
            "deal": [deal],
            "thinking_intervention": [thinking_intervention]
        },
        runtime_overrides={
            "PASTA": {"substrings": "instructions"},
            "DeAL": {
                "reward_params": {
                    "instruction_id_list": "instruction_id_list",
                    "instructions": "instructions",
                    "kwargs": "kwargs",
                }
            },
            "ThinkingIntervention": {"params": {"instructions": "instructions"}},
        },
        gen_kwargs={
            "max_new_tokens": 1024,
            "do_sample": False,
            "output_attentions": True,  # mandatory for PASTA
        },
        hf_model_kwargs={"attn_implementation": "eager"}, # mandatory for PASTA
    )

profiles = benchmark.run()
benchmark.export(profiles=profiles, save_dir="./results")
from aisteer360.evaluation.benchmark import Benchmark
from transformers import logging
logging.set_verbosity_error()

benchmark = Benchmark(
        use_case=instruction_following,
        base_model_name_or_path="Qwen/Qwen2.5-1.5B-Instruct",
        steering_pipelines={
            "baseline": [], # no steering
            "pasta": [pasta],
            "deal": [deal],
            "thinking_intervention": [thinking_intervention]
        },
        runtime_overrides={
            "PASTA": {"substrings": "instructions"},
            "DeAL": {
                "reward_params": {
                    "instruction_id_list": "instruction_id_list",
                    "instructions": "instructions",
                    "kwargs": "kwargs",
                }
            },
            "ThinkingIntervention": {"params": {"instructions": "instructions"}},
        },
        gen_kwargs={
            "max_new_tokens": 1024,
            "do_sample": False,
            "output_attentions": True,  # mandatory for PASTA
        },
        hf_model_kwargs={"attn_implementation": "eager"}, # mandatory for PASTA
    )

profiles = benchmark.run()
benchmark.export(profiles=profiles, save_dir="./results")

Running pipeline: baseline...
done.
Running pipeline: pasta...
done.
Running pipeline: deal...
done.
Running pipeline: thinking_intervention...
done.

Inspecting the profiles¶

Each steering pipeline in the benchmark yields an evaluation profile. Each evaluation profile contains metric values as computed by the metrics passed in to the use case, in this case StrictInstruction.

In [9]:

Copied!

print(json.dumps(profiles['baseline']['evaluations'], indent=2))
print(json.dumps(profiles['baseline']['evaluations'], indent=2))

{
  "StrictInstruction": {
    "strict_prompt_accuracy": 0.5,
    "strict_instruction_accuracy": 0.5131578947368421
  }
}

In [10]:

Copied!

print(json.dumps(profiles['pasta']['evaluations'], indent=2))
print(json.dumps(profiles['pasta']['evaluations'], indent=2))

{
  "StrictInstruction": {
    "strict_prompt_accuracy": 0.2,
    "strict_instruction_accuracy": 0.27631578947368424
  }
}

In [11]:

Copied!

print(json.dumps(profiles['deal']['evaluations'], indent=2))
print(json.dumps(profiles['deal']['evaluations'], indent=2))

{
  "StrictInstruction": {
    "strict_prompt_accuracy": 0.2,
    "strict_instruction_accuracy": 0.3157894736842105
  }
}

In [12]:

Copied!

print(json.dumps(profiles['thinking_intervention']['evaluations'], indent=2))
print(json.dumps(profiles['thinking_intervention']['evaluations'], indent=2))

{
  "StrictInstruction": {
    "strict_prompt_accuracy": 0.26,
    "strict_instruction_accuracy": 0.4473684210526316
  }
}