Instruction Following¶

In this notebook, we study the instruction following ability of a model across a range of instruction types. Additionally, we inspect if steering the model to be better at following instructions impacts the model's response quality in general.

Runtime Estimate¶

Estimated Time: 20-25 minutes
Device: NVIDIA H100 GPU (80GB VRAM)

Times are approximate and vary based on dataset size, number of sweeps, and model configuration. Adjust parameters in the cells below to modify runtime.

Setup¶

In [1]:

Copied!





import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from pathlib import Path
from datasets import load_dataset
from transformers import logging as hf_logging

from aisteer360.algorithms.state_control.pasta.control import PASTA
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.evaluation.use_cases.instruction_following import InstructionFollowing
from aisteer360.evaluation.metrics.custom.instruction_following.strict_instruction import StrictInstruction
from aisteer360.evaluation.metrics.generic.reward_score import RewardScore
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils.data_utils import (
    flatten_profiles,
    summarize_by_config,
    get_param_values,
    build_per_example_df,
    to_jsonable,
)
from aisteer360.evaluation.utils.viz_utils import (
    plot_metric_heatmap,
    plot_sensitivity,
    plot_tradeoff,
)

hf_logging.set_verbosity_error()

MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

# directory for saving figures (local to this notebook)
NOTEBOOK_DIR = Path(__file__).parent if "__file__" in dir() else Path.cwd() / "examples/notebooks/benchmark_instruction_following"
FIGURE_DIR = NOTEBOOK_DIR / "figures"
FIGURE_DIR.mkdir(exist_ok=True)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from pathlib import Path
from datasets import load_dataset
from transformers import logging as hf_logging

from aisteer360.algorithms.state_control.pasta.control import PASTA
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.evaluation.use_cases.instruction_following import InstructionFollowing
from aisteer360.evaluation.metrics.custom.instruction_following.strict_instruction import StrictInstruction
from aisteer360.evaluation.metrics.generic.reward_score import RewardScore
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils.data_utils import (
    flatten_profiles,
    summarize_by_config,
    get_param_values,
    build_per_example_df,
    to_jsonable,
)
from aisteer360.evaluation.utils.viz_utils import (
    plot_metric_heatmap,
    plot_sensitivity,
    plot_tradeoff,
)

hf_logging.set_verbosity_error()

MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"

# directory for saving figures (local to this notebook)
NOTEBOOK_DIR = Path(__file__).parent if "__file__" in dir() else Path.cwd() / "examples/notebooks/benchmark_instruction_following"
FIGURE_DIR = NOTEBOOK_DIR / "figures"
FIGURE_DIR.mkdir(exist_ok=True)

/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Data preparation¶

There are innumerable types of instructions that a model can be prompted with. To better understand a model's instruction following ability, we explore model behavior across a specific set of instruction types as organized by the IFEval dataset. For the purposes of this study, we make use of our modified version of the IFEval dataset, termed Split-IFEval, in which the instructions are explicitly extracted from the prompt (this makes it easier to create interventions that rely directly on these tokens).

In [2]:

Copied!





ifeval = load_dataset("ibm-research/Split-IFEval")
ifeval_df = ifeval["train"].to_pandas()

cols = ["instructions", "instruction_id_list", "kwargs"]
for col in cols:
    ifeval_df[col] = ifeval_df[col].apply(
        lambda x: x.tolist() if isinstance(x, np.ndarray) else x
    )

ifeval_df
ifeval = load_dataset("ibm-research/Split-IFEval")
ifeval_df = ifeval["train"].to_pandas()

cols = ["instructions", "instruction_id_list", "kwargs"]
for col in cols:
    ifeval_df[col] = ifeval_df[col].apply(
        lambda x: x.tolist() if isinstance(x, np.ndarray) else x
    )

ifeval_df

Out[2]:

	key	prompt	instruction_id_list	kwargs	separated_prompt	instructions	original_prompt
0	1000	Write a summary of the wikipedia page "https:/...	[punctuation:no_comma, detectable_format:numbe...	[{'num_bullets': None, 'num_highlights': None,...	Write a summary of the wikipedia page "https:/...	[- Write 300+ words, - Do not use any commas, ...	Write a 300+ word summary of the wikipedia pag...
1	1001	I am planning a trip to Japan, and I would lik...	[punctuation:no_comma]	[{'num_bullets': None, 'num_highlights': None,...	I am planning a trip to Japan, and I would lik...	[- You are not allowed to use any commas in yo...	I am planning a trip to Japan, and I would lik...
2	1005	Write a resume for a fresh high school graduat...	[detectable_content:number_placeholders]	[{'num_bullets': None, 'num_highlights': None,...	Write a resume for a fresh high school graduat...	[- Make sure to include at least 12 placeholde...	Write a resume for a fresh high school graduat...
3	1012	Write an email to my boss telling him that I a...	[combination:repeat_prompt, detectable_format:...	[{'num_bullets': None, 'num_highlights': None,...	Write an email to my boss telling him that I a...	[- First repeat the request word for word with...	Write an email to my boss telling him that I a...
4	1019	Given the sentence "Two young boys with toy gu...	[change_case:english_lowercase]	[{'num_bullets': None, 'num_highlights': None,...	Given the sentence "Two young boys with toy gu...	[- Please ensure that your response is in Engl...	Given the sentence "Two young boys with toy gu...
...	...	...	...	...	...	...	...
536	3753	If a + b + c = 30 and b = 10 and c = 5. Is a =...	[detectable_format:constrained_response]	[{'num_bullets': None, 'num_highlights': None,...	If a + b + c = 30 and b = 10 and c = 5. Is a =...	[- Answer "My answer is yes." or "My answer is...	If a + b + c = 30 and b = 10 and c = 5. Is a =...
537	3754	If Bob beat Martha in a game of pool. And Mart...	[detectable_format:constrained_response]	[{'num_bullets': None, 'num_highlights': None,...	If Bob beat Martha in a game of pool. And Mart...	[- Your answer must contain exactly one of the...	If Bob beat Martha in a game of pool. And Mart...
538	3755	Can Batman beat Superman in a fair one on one ...	[detectable_format:constrained_response]	[{'num_bullets': None, 'num_highlights': None,...	Can Batman beat Superman in a fair one on one ...	[- You should just say "My answer is yes." or ...	Can Batman beat Superman in a fair one on one ...
539	3756	Is Pikachu one of the Avengers?\n\nYour respon...	[detectable_format:constrained_response]	[{'num_bullets': None, 'num_highlights': None,...	Is Pikachu one of the Avengers?	[- Think out loud, then answer with one of the...	Is Pikachu one of the Avengers? Think out loud...
540	3757	Would you consider yourself to be smart?\n\nYo...	[detectable_format:constrained_response]	[{'num_bullets': None, 'num_highlights': None,...	Would you consider yourself to be smart?	[- Choose from:\nMy answer is yes.\nMy answer ...	Would you consider yourself to be smart? Choos...

541 rows × 7 columns

Notice via the instruction_id_list column, each prompt can in general contain a number of instructions. We'll focus on the prompts that contain a single example.

In [3]:

Copied!





ifeval_df["num_instructions"] = ifeval_df["instruction_id_list"].apply(len)
single_instr_df = ifeval_df[ifeval_df["num_instructions"] == 1].copy()
single_instr_df["instruction_id"] = single_instr_df["instruction_id_list"].apply(lambda ids: ids[0])
instruction_group_sizes = (
    single_instr_df["instruction_id"]
    .value_counts()
    .rename_axis("instruction_id")
    .reset_index(name="count")
)

instruction_group_sizes.head(20)
ifeval_df["num_instructions"] = ifeval_df["instruction_id_list"].apply(len)
single_instr_df = ifeval_df[ifeval_df["num_instructions"] == 1].copy()
single_instr_df["instruction_id"] = single_instr_df["instruction_id_list"].apply(lambda ids: ids[0])
instruction_group_sizes = (
    single_instr_df["instruction_id"]
    .value_counts()
    .rename_axis("instruction_id")
    .reset_index(name="count")
)

instruction_group_sizes.head(20)

Out[3]:

	instruction_id	count
0	keywords:forbidden_words	19
1	detectable_format:number_highlighted_sections	19
2	combination:repeat_prompt	18
3	startend:end_checker	17
4	language:response_language	17
5	punctuation:no_comma	16
6	startend:quotation	14
7	detectable_format:number_bullet_lists	13
8	change_case:english_lowercase	13
9	detectable_format:title	13
10	detectable_content:postscript	13
11	length_constraints:number_sentences	13
12	keywords:frequency	12
13	length_constraints:number_words	11
14	keywords:letter_frequency	11
15	change_case:english_capital	11
16	detectable_content:number_placeholders	10
17	length_constraints:number_paragraphs	10
18	detectable_format:constrained_response	10
19	combination:two_responses	9

We'll study the following instruction types:

keywords:forbidden_words: describes that the response must avoid using anything from the specified forbidden list.
detectable_format:number_highlighted_sections: describes that the response must contain at least a specified number of highlighted sections using a defined markup pattern.
language:response_language: indicates that the model must generate its entire response in a specific target language.
startend:end_checker: describes that the response must end with an exact required phrase (with nothing extra following it).

In [4]:

Copied!





instruction_types = [
    "keywords:forbidden_words",
    "detectable_format:number_highlighted_sections",
    "language:response_language",
    "startend:end_checker",
]

filtered_df = single_instr_df[
    single_instr_df["instruction_id"].isin(instruction_types)
].copy()

balanced_filtered = (
    filtered_df.groupby("instruction_id")
    .apply(lambda g: g.sample(min(len(g), 12), random_state=123))
    .reset_index(drop=True)
)

balanced_filtered
instruction_types = [
    "keywords:forbidden_words",
    "detectable_format:number_highlighted_sections",
    "language:response_language",
    "startend:end_checker",
]

filtered_df = single_instr_df[
    single_instr_df["instruction_id"].isin(instruction_types)
].copy()

balanced_filtered = (
    filtered_df.groupby("instruction_id")
    .apply(lambda g: g.sample(min(len(g), 12), random_state=123))
    .reset_index(drop=True)
)

balanced_filtered

/tmp/ipykernel_31192/3282579761.py:14: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(lambda g: g.sample(min(len(g), 12), random_state=123))

Out[4]:

	key	prompt	instruction_id_list	kwargs	separated_prompt	instructions	original_prompt	num_instructions	instruction_id
0	3644	Write a blog post about interesting facts abou...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 2.0, ...	Write a blog post about interesting facts abou...	[- Italicize at least 2 sections in your answe...	Write a blog post about interesting facts abou...	1	detectable_format:number_highlighted_sections
1	1773	Write a song about the summers of my childhood...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 1.0, ...	Write a song about the summers of my childhood...	[- Give the song a name, and highlight the nam...	Write a song about the summers of my childhood...	1	detectable_format:number_highlighted_sections
2	168	Write a funny and sarcastic template for ratin...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 3.0, ...	Write a funny and sarcastic template for ratin...	[- Please highlight at least 3 sections with m...	Write a funny and sarcastic template for ratin...	1	detectable_format:number_highlighted_sections
3	3549	Write a funny Haiku about a Quaker named John ...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 2.0, ...	Write a funny Haiku about a Quaker named John ...	[- Use the asterisk symbol, *, to highlight so...	Write a funny Haiku about a Quaker named John ...	1	detectable_format:number_highlighted_sections
4	2253	Write a template for a workshop on the importa...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 3.0, ...	Write a template for a workshop on the importa...	[- Highlight at least 3 sections with markdown...	Write a template for a workshop on the importa...	1	detectable_format:number_highlighted_sections
5	2790	Write a funny rap about a man who gets a call ...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 1.0, ...	Write a funny rap about a man who gets a call ...	[- Use markdown to highlight at least one sect...	Write a funny rap about a man who gets a call ...	1	detectable_format:number_highlighted_sections
6	2381	Write a cover letter to a local political part...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 3.0, ...	Write a cover letter to a local political part...	[- Make sure to highlight at least 3 sections ...	Write a cover letter to a local political part...	1	detectable_format:number_highlighted_sections
7	1307	Write an outline for a paper on the history of...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 15.0,...	Write an outline for a paper on the history of...	[- The outline should include the main points ...	Write an outline for a paper on the history of...	1	detectable_format:number_highlighted_sections
8	3071	Write a rap about the renaissance.\n\nYour res...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 3.0, ...	Write a rap about the renaissance.	[- It should be noticeably different from raps...	Write a rap about the renaissance. It should b...	1	detectable_format:number_highlighted_sections
9	3453	Summarize the history of Japan.\n\nYour respon...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 5.0, ...	Summarize the history of Japan.	[- Italicize at least 5 keywords in your respo...	Summarize the history of Japan. Italicize at l...	1	detectable_format:number_highlighted_sections
10	2515	Gideon is a farmer who has a surplus of crops ...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 1.0, ...	Gideon is a farmer who has a surplus of crops ...	[- Highlight at least one section of your answ...	Gideon is a farmer who has a surplus of crops ...	1	detectable_format:number_highlighted_sections
11	2759	Write a description of the following data in a...	[detectable_format:number_highlighted_sections]	[{'num_bullets': None, 'num_highlights': 3.0, ...	Write a description of the following data in a...	[- Use markdown to highlight at least 3 sectio...	Write a description of the following data in a...	1	detectable_format:number_highlighted_sections
12	3595	Write a very short poem about the beauty of a ...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Write a very short poem about the beauty of a ...	[- Do not include the keywords beauty and pretty]	Write a very short poem about the beauty of a ...	1	keywords:forbidden_words
13	2034	Write a summary of the following text in a fun...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Write a summary of the following text in a fun...	[- Do not include "enzymes" and "antibodies" i...	Write a summary of the following text in a fun...	1	keywords:forbidden_words
14	2028	Are the weather conditions in the Arctic very ...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Are the weather conditions in the Arctic very ...	[- Do not say 'yes' or 'no' throughout your en...	Are the weather conditions in the Arctic very ...	1	keywords:forbidden_words
15	3401	Can you give me a zany, bullet point TLDR of t...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Can you give me a zany, bullet point TLDR of t...	[- Make it zany, - Do not include the keywords...	Can you give me a zany, bullet point TLDR of t...	1	keywords:forbidden_words
16	2328	Write a startup pitch for a time capsule servi...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Write a startup pitch for a time capsule service.	[- The words startup and capsule cannot be in ...	Write a startup pitch for a time capsule servi...	1	keywords:forbidden_words
17	2957	Rewrite the limerick in a strange way. In part...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Rewrite the limerick in a strange way. In part...	[- Do not mention nursery and storytelling in ...	Rewrite the limerick in a strange way. In part...	1	keywords:forbidden_words
18	2432	My best friend drowned yesterday and I'm so sa...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	My best friend drowned yesterday and I'm so sa...	[- Please don't include the keywords "died" or...	My best friend drowned yesterday and I'm so sa...	1	keywords:forbidden_words
19	1147	Rewrite the following statement to make it sou...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Rewrite the following statement to make it sou...	[- Do not include the following keywords: fiel...	Rewrite the following statement to make it sou...	1	keywords:forbidden_words
20	3081	Can you re-create a story from a fictional new...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Can you re-create a story from a fictional new...	[- Please include a critique of the story and ...	Can you re-create a story from a fictional new...	1	keywords:forbidden_words
21	3166	What are the steps to be followed for the docu...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	What are the steps to be followed for the docu...	[- Just list the steps without saying the word...	What are the steps to be followed for the docu...	1	keywords:forbidden_words
22	2534	Translate the following sentence into German a...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Translate the following sentence into German a...	[- Avoid the word "schlau" throughout your res...	Translate the following sentence into German a...	1	keywords:forbidden_words
23	2828	Write a parody of 'ars poetica'.\n\nYour respo...	[keywords:forbidden_words]	[{'num_bullets': None, 'num_highlights': None,...	Write a parody of 'ars poetica'.	[- Do not include the word 'parody' throughout...	Write a parody of 'ars poetica'. Do not includ...	1	keywords:forbidden_words
24	2225	what is the difference between a levee and an ...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	what is the difference between a levee and an ...	[- Please respond to me only in Korean]	what is the difference between a levee and an ...	1	language:response_language
25	2685	Please give me some recommendations for good b...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Please give me some recommendations for good b...	[- Your response should be completely in Kanna...	Please give me some recommendations for good b...	1	language:response_language
26	3682	Give me a summary of the lobbying spending of ...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Give me a summary of the lobbying spending of ...	[- Your response should be in German language,...	Give me a summary of the lobbying spending of ...	1	language:response_language
27	2464	What are some good ideas for startup companies...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	What are some good ideas for startup companies...	[- Use only Hindi in your response, no other l...	What are some good ideas for startup companies...	1	language:response_language
28	2299	Write a lame joke about engagements.\n\nYour r...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Write a lame joke about engagements.	[- In entirely Swahili, no other language is a...	Write a lame joke about engagements in entirel...	1	language:response_language
29	240	What is a lattice? Rewrite the answer to be un...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	What is a lattice? Rewrite the answer to be un...	[- Make sure it's entirely in Russian, no othe...	What is a lattice? Rewrite the answer to be un...	1	language:response_language
30	1108	Are hamburgers sandwiches?\n\nYour response sh...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Are hamburgers sandwiches?	[- Please respond using only the Kannada langu...	Are hamburgers sandwiches? Please respond usin...	1	language:response_language
31	3112	Can you think of a good question to ask during...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Can you think of a good question to ask during...	[- Your entire response should be in Gujarati,...	Can you think of a good question to ask during...	1	language:response_language
32	3130	Write an angry letter complaining about the fo...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Write an angry letter complaining about the fo...	[- Using only Hindi, no other language is allo...	Write an angry letter complaining about the fo...	1	language:response_language
33	1477	Write a weird poem about yoda being transporte...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Write a weird poem about yoda being transporte...	[- Write in the Persian language, no other lan...	Write a weird poem about yoda being transporte...	1	language:response_language
34	1154	Write a rubric for how to evaluate the technic...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Write a rubric for how to evaluate the technic...	[- Only use the Punjabi language, no other lan...	Write a rubric for how to evaluate the technic...	1	language:response_language
35	2309	Tell a joke that has the words thursday and am...	[language:response_language]	[{'num_bullets': None, 'num_highlights': None,...	Tell a joke that has the words thursday and am...	[- Use Swahili language only, no other languag...	Tell a joke that has the words thursday and am...	1	language:response_language
36	1893	Write a strange rap song about Alexander the G...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	Write a strange rap song about Alexander the G...	[- Finish the song with:\n\nPeace!\n\n, - No a...	Write a strange rap song about Alexander the G...	1	startend:end_checker
37	2475	Write a TLDR for the recent conflict between I...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	Write a TLDR for the recent conflict between I...	[- End your response with this exact phrase: "...	Write a TLDR for the recent conflict between I...	1	startend:end_checker
38	3203	May name is Naomi. Write a blog post in my nam...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	May name is Naomi. Write a blog post in my nam...	[- End the blog post with "Naomi thanks you fo...	May name is Naomi. Write a blog post in my nam...	1	startend:end_checker
39	2398	Give me a poem about California.\n\nYour respo...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	Give me a poem about California.	[- The very end of your entire response should...	Give me a poem about California. The very end ...	1	startend:end_checker
40	1902	How can I learn to code?\n\nYour response shou...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	How can I learn to code?	[- Finish your response with "Follow the 5 ste...	How can I learn to code? Finish your response ...	1	startend:end_checker
41	2268	What is multivariate analysis? Rewrite the ans...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	What is multivariate analysis? Rewrite the ans...	[- Please end your response with "Is there any...	What is multivariate analysis? Rewrite the ans...	1	startend:end_checker
42	1128	Given the sentence "It is unclear how much of ...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	Given the sentence "It is unclear how much of ...	[- The very last sentence of your response sho...	Given the sentence "It is unclear how much of ...	1	startend:end_checker
43	2505	Improve the following text, which is about how...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	Improve the following text, which is about how...	[- Finish your response with "Is there anythin...	Improve the following text, which is about how...	1	startend:end_checker
44	2677	Write a limerick about a guy named Dave that i...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	Write a limerick about a guy named Dave that i...	[- The limerick should end with the phrase "Ye...	Write a limerick about a guy named Dave that i...	1	startend:end_checker
45	1659	I'm a 12th grader and I need some help with my...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	I'm a 12th grader and I need some help with my...	[- The very end of your response should read "...	I'm a 12th grader and I need some help with my...	1	startend:end_checker
46	1220	Write a poem about two people who meet in a co...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	Write a poem about two people who meet in a co...	[- End your entire response with the exact phr...	Write a poem about two people who meet in a co...	1	startend:end_checker
47	1939	I'm a new puppy owner and I'm looking for some...	[startend:end_checker]	[{'num_bullets': None, 'num_highlights': None,...	I'm a new puppy owner and I'm looking for some...	[- In particular, I need you to end your respo...	I'm a new puppy owner and I'm looking for some...	1	startend:end_checker

Evaluation data takes the form of a prompt (including instructions), the specific instructions (separated from the prompt), the IDs of the instructions, and any associated kwargs for the instructions.

In [5]:

Copied!





evaluation_data = [
    {
        "prompt": row["prompt"],
        "instructions": to_jsonable(row["instructions"]),
        "instruction_id_list": to_jsonable(row["instruction_id_list"]),
        "kwargs": to_jsonable(row["kwargs"]),
    }
    for _, row in balanced_filtered.iterrows()
]

len(evaluation_data), evaluation_data[0]
evaluation_data = [
    {
        "prompt": row["prompt"],
        "instructions": to_jsonable(row["instructions"]),
        "instruction_id_list": to_jsonable(row["instruction_id_list"]),
        "kwargs": to_jsonable(row["kwargs"]),
    }
    for _, row in balanced_filtered.iterrows()
]

len(evaluation_data), evaluation_data[0]

Out[5]:

(48,
 {'prompt': 'Write a blog post about interesting facts about the Dutch language.\n\nYour response should follow the instructions below:\n- Italicize at least 2 sections in your answer with markdown, i.e. *italic text*',
  'instructions': ['- Italicize at least 2 sections in your answer with markdown, i.e. *italic text*'],
  'instruction_id_list': ['detectable_format:number_highlighted_sections'],
  'kwargs': [{'num_bullets': None,
    'num_highlights': 2.0,
    'relation': None,
    'num_words': None,
    'capital_relation': None,
    'capital_frequency': None,
    'num_sentences': None,
    'end_phrase': None,
    'keyword': None,
    'frequency': None,
    'prompt_to_repeat': None,
    'first_word': None,
    'num_paragraphs': None,
    'nth_paragraph': None,
    'let_relation': None,
    'letter': None,
    'let_frequency': None,
    'section_spliter': None,
    'num_sections': None,
    'postscript_marker': None,
    'forbidden_words': None,
    'num_placeholders': None,
    'language': None,
    'keywords': None}]})

Defining the benchmark¶

We use the ControlSpec class to sweep the steering strength alpha. The impacted layers and the method are assumed to be fixed throughout.

In [6]:

Copied!





pasta_spec = ControlSpec(
    control_cls=PASTA,
    params={
        "head_config": list(range(8, 24)),
        "scale_position": "include",
    },
    vars=[
        {"alpha": 5.0},
        {"alpha": 10.0},
        {"alpha": 15.0},
        {"alpha": 20.0},
        {"alpha": 25.0},
        {"alpha": 30.0},
    ],
    name="PASTA",
)
pasta_spec = ControlSpec(
    control_cls=PASTA,
    params={
        "head_config": list(range(8, 24)),
        "scale_position": "include",
    },
    vars=[
        {"alpha": 5.0},
        {"alpha": 10.0},
        {"alpha": 15.0},
        {"alpha": 20.0},
        {"alpha": 25.0},
        {"alpha": 30.0},
    ],
    name="PASTA",
)

The instruction following use case is initialized with two metrics: StrictInstruction and RewardScore. We will be studying the trade-off between these two metrics.

In [7]:

Copied!





instruction_following = InstructionFollowing(
    evaluation_data=evaluation_data,
    evaluation_metrics=[
        StrictInstruction(),
        RewardScore(
            model_or_id="OpenAssistant/reward-model-deberta-v3-large-v2",
            score_transform="identity",
            batch_size=8,
            max_length=1024,
            return_logits=False,
        )
    ],
)
instruction_following = InstructionFollowing(
    evaluation_data=evaluation_data,
    evaluation_metrics=[
        StrictInstruction(),
        RewardScore(
            model_or_id="OpenAssistant/reward-model-deberta-v3-large-v2",
            score_transform="identity",
            batch_size=8,
            max_length=1024,
            return_logits=False,
        )
    ],
)

The benchmark can then be defined on two steering pipelines: the baseline (unsteered) model, and the above pasta_spec. Note the use of runtime_overrides to inform PASTA that it should populate its internal substrings argument with the instructions column from evaluation_data.

In [8]:

Copied!





benchmark = Benchmark(
    use_case=instruction_following,
    base_model_name_or_path=MODEL_NAME,
    steering_pipelines={
        "baseline": [],
        "pasta_alpha_sweep": [pasta_spec],
    },
    runtime_overrides={
        "PASTA": {"substrings": "instructions"},
    },
    gen_kwargs={
        "max_new_tokens": 128,
        "do_sample": True,
        "output_attentions": True,
    },
    hf_model_kwargs={
        "attn_implementation": "eager",
    },
    device_map="auto",
    num_trials=10
)
benchmark = Benchmark(
    use_case=instruction_following,
    base_model_name_or_path=MODEL_NAME,
    steering_pipelines={
        "baseline": [],
        "pasta_alpha_sweep": [pasta_spec],
    },
    runtime_overrides={
        "PASTA": {"substrings": "instructions"},
    },
    gen_kwargs={
        "max_new_tokens": 128,
        "do_sample": True,
        "output_attentions": True,
    },
    hf_model_kwargs={
        "attn_implementation": "eager",
    },
    device_map="auto",
    num_trials=10
)

Running the benchmark yields the profiles across the baseline and the full set of configurations in the spec.

In [ ]:

Copied!

profiles = benchmark.run()
profiles = benchmark.run()

Running pipeline: baseline...

done.
Running pipeline: pasta_alpha_sweep...
Running configuration 1...

Analysis¶

We can now examine the relationship between steering strength and both instruction following and response quality. The following sections break down the results by configuration, visualize the accuracy-reward tradeoff, and provide per-example and per-instruction-type analyses.

We first convert the nested benchmark output into a flat DataFrame with one row per run, extracting the metrics of interest (via flatten_profiles).

In [10]:

Copied!





runs_df = flatten_profiles(
    profiles,
    metric_accessors={
        "strict_prompt_acc": ("StrictInstruction", "strict_prompt_accuracy"),
        "strict_instr_acc": ("StrictInstruction", "strict_instruction_accuracy"),
        "mean_reward": ("RewardScore", "mean_reward"),
    }
)

# extract the swept alpha parameter and compute steering strength
runs_df["alpha"] = get_param_values(runs_df, "PASTA", "alpha")
runs_df["steering_strength"] = runs_df["alpha"].apply(lambda a: 0.0 if pd.isna(a) else -np.log(a))

display(
    runs_df
    .drop(columns=["_run", "params"])
    .sort_values(["alpha", "trial_id"], na_position="last")
    .reset_index(drop=True)
)
runs_df = flatten_profiles(
    profiles,
    metric_accessors={
        "strict_prompt_acc": ("StrictInstruction", "strict_prompt_accuracy"),
        "strict_instr_acc": ("StrictInstruction", "strict_instruction_accuracy"),
        "mean_reward": ("RewardScore", "mean_reward"),
    }
)

# extract the swept alpha parameter and compute steering strength
runs_df["alpha"] = get_param_values(runs_df, "PASTA", "alpha")
runs_df["steering_strength"] = runs_df["alpha"].apply(lambda a: 0.0 if pd.isna(a) else -np.log(a))

display(
    runs_df
    .drop(columns=["_run", "params"])
    .sort_values(["alpha", "trial_id"], na_position="last")
    .reset_index(drop=True)
)

	pipeline	trial_id	config_id	strict_prompt_acc	strict_instr_acc	mean_reward	alpha	steering_strength
0	pasta_alpha_sweep	0	ab6a3b5c	0.562500	0.562500	-1.666518	5.0	-1.609438
1	pasta_alpha_sweep	1	ab6a3b5c	0.625000	0.625000	-1.906236	5.0	-1.609438
2	pasta_alpha_sweep	2	ab6a3b5c	0.479167	0.479167	-1.671621	5.0	-1.609438
3	pasta_alpha_sweep	3	ab6a3b5c	0.500000	0.500000	-1.870120	5.0	-1.609438
4	pasta_alpha_sweep	4	ab6a3b5c	0.479167	0.479167	-1.563269	5.0	-1.609438
...	...	...	...	...	...	...	...	...
65	baseline	5	baseline	0.583333	0.583333	-1.366600	NaN	0.000000
66	baseline	6	baseline	0.541667	0.541667	-1.417553	NaN	0.000000
67	baseline	7	baseline	0.479167	0.479167	-1.469377	NaN	0.000000
68	baseline	8	baseline	0.583333	0.583333	-1.442388	NaN	0.000000
69	baseline	9	baseline	0.541667	0.541667	-1.351764	NaN	0.000000

70 rows × 8 columns

Summarizing by configuration¶

Note that the benchmark was run with multiple trials. This allows us to aggregate metrics across trials to compute statistics (mean and standard deviation) of performance under each configuration.

In [11]:

Copied!





# summarize across trials
summary = summarize_by_config(
    runs_df,
    metric_cols=["strict_prompt_acc", "strict_instr_acc", "mean_reward"],
    group_cols=["pipeline", "config_id"],
)

# add alpha back from runs_df (first value per config)
alpha_map = runs_df.groupby(["pipeline", "config_id"])["alpha"].first()
summary["alpha"] = summary.apply(
    lambda row: alpha_map.get((row["pipeline"], row["config_id"]), np.nan), axis=1
)

# add a readable config label and steering strength (-log(alpha))
summary["config"] = summary["alpha"].apply(
    lambda a: "baseline" if pd.isna(a) else f"alpha={a}"
)
summary["steering_strength"] = summary["alpha"].apply(
    lambda a: 0.0 if pd.isna(a) else -np.log(a)
)

display(summary[[
    "config", "alpha", "steering_strength", "n_trials",
    "strict_prompt_acc_mean", "strict_prompt_acc_std",
    "mean_reward_mean", "mean_reward_std"
]].sort_values("steering_strength").round(4))
# summarize across trials
summary = summarize_by_config(
    runs_df,
    metric_cols=["strict_prompt_acc", "strict_instr_acc", "mean_reward"],
    group_cols=["pipeline", "config_id"],
)

# add alpha back from runs_df (first value per config)
alpha_map = runs_df.groupby(["pipeline", "config_id"])["alpha"].first()
summary["alpha"] = summary.apply(
    lambda row: alpha_map.get((row["pipeline"], row["config_id"]), np.nan), axis=1
)

# add a readable config label and steering strength (-log(alpha))
summary["config"] = summary["alpha"].apply(
    lambda a: "baseline" if pd.isna(a) else f"alpha={a}"
)
summary["steering_strength"] = summary["alpha"].apply(
    lambda a: 0.0 if pd.isna(a) else -np.log(a)
)

display(summary[[
    "config", "alpha", "steering_strength", "n_trials",
    "strict_prompt_acc_mean", "strict_prompt_acc_std",
    "mean_reward_mean", "mean_reward_std"
]].sort_values("steering_strength").round(4))

	config	alpha	steering_strength	n_trials	strict_prompt_acc_mean	strict_prompt_acc_std	mean_reward_mean	mean_reward_std
6	alpha=30.0	30.0	-3.4012	10.0	0.4563	0.0317	-2.7017	0.1622
5	alpha=25.0	25.0	-3.2189	10.0	0.5042	0.0562	-2.4965	0.0963
4	alpha=20.0	20.0	-2.9957	10.0	0.5313	0.0549	-2.3493	0.1737
3	alpha=15.0	15.0	-2.7081	10.0	0.5583	0.0479	-2.1441	0.1183
2	alpha=10.0	10.0	-2.3026	10.0	0.5687	0.0261	-2.0017	0.1735
1	alpha=5.0	5.0	-1.6094	10.0	0.5292	0.0503	-1.6606	0.1495
0	baseline	NaN	0.0000	10.0	0.5375	0.0437	-1.4553	0.0909

Tradeoff visualization¶

The 3-panel figure below shows how instruction following and response quality each vary with steering strength (-log(alpha)), along with their joint tradeoff. The baseline (unsteered) model is shown for reference.

In [12]:

Copied!





baseline = summary[summary["pipeline"] == "baseline"]
swept = summary[summary["pipeline"] != "baseline"].sort_values("steering_strength")
per_trial_data = runs_df[runs_df["pipeline"] != "baseline"]

fig = plt.figure(figsize=(14, 4))
gs = gridspec.GridSpec(1, 3, width_ratios=[1, 1, 1], wspace=0.3)
axes = [fig.add_subplot(gs[0, i]) for i in range(3)]

# instruction following sensitivity
plot_sensitivity(
    swept,
    metric="strict_prompt_acc",
    sweep_col="steering_strength",
    baseline=baseline,
    per_trial_data=per_trial_data,
    ax=axes[0],
    metric_label="strict prompt accuracy",
    sweep_label="steering strength (-log alpha)",
    save_path=FIGURE_DIR / "sensitivity_strict_prompt_acc.png",
)

# reward sensitivity
plot_sensitivity(
    swept,
    metric="mean_reward",
    sweep_col="steering_strength",
    baseline=baseline,
    per_trial_data=per_trial_data,
    ax=axes[1],
    metric_label="mean reward score",
    sweep_label="steering strength (-log alpha)",
    save_path=FIGURE_DIR / "sensitivity_mean_reward.png",
)

# tradeoff scatter with Pareto frontier
plot_tradeoff(
    swept,
    x_metric="strict_prompt_acc",
    y_metric="mean_reward",
    sweep_col="steering_strength",
    baseline=baseline,
    per_trial_data=per_trial_data,
    ax=axes[2],
    x_label="strict prompt accuracy",
    y_label="mean reward score",
    sweep_label="steering strength (-log alpha)",
    save_path=FIGURE_DIR / "tradeoff.png",
)

# fig.savefig(FIGURE_DIR / "tradeoff_analysis.png", bbox_inches="tight", dpi=150)
plt.show()
baseline = summary[summary["pipeline"] == "baseline"]
swept = summary[summary["pipeline"] != "baseline"].sort_values("steering_strength")
per_trial_data = runs_df[runs_df["pipeline"] != "baseline"]

fig = plt.figure(figsize=(14, 4))
gs = gridspec.GridSpec(1, 3, width_ratios=[1, 1, 1], wspace=0.3)
axes = [fig.add_subplot(gs[0, i]) for i in range(3)]

# instruction following sensitivity
plot_sensitivity(
    swept,
    metric="strict_prompt_acc",
    sweep_col="steering_strength",
    baseline=baseline,
    per_trial_data=per_trial_data,
    ax=axes[0],
    metric_label="strict prompt accuracy",
    sweep_label="steering strength (-log alpha)",
    save_path=FIGURE_DIR / "sensitivity_strict_prompt_acc.png",
)

# reward sensitivity
plot_sensitivity(
    swept,
    metric="mean_reward",
    sweep_col="steering_strength",
    baseline=baseline,
    per_trial_data=per_trial_data,
    ax=axes[1],
    metric_label="mean reward score",
    sweep_label="steering strength (-log alpha)",
    save_path=FIGURE_DIR / "sensitivity_mean_reward.png",
)

# tradeoff scatter with Pareto frontier
plot_tradeoff(
    swept,
    x_metric="strict_prompt_acc",
    y_metric="mean_reward",
    sweep_col="steering_strength",
    baseline=baseline,
    per_trial_data=per_trial_data,
    ax=axes[2],
    x_label="strict prompt accuracy",
    y_label="mean reward score",
    sweep_label="steering strength (-log alpha)",
    save_path=FIGURE_DIR / "tradeoff.png",
)

# fig.savefig(FIGURE_DIR / "tradeoff_analysis.png", bbox_inches="tight", dpi=150)
plt.show()

No description has been provided for this image

Per-example analysis¶

We can drill down to individual examples to understand which prompts benefited from steering. Here we compare the baseline against a steered configuration to find cases where steering fixed instruction following, and examine the impact on reward.

In [13]:

Copied!





def get_run_by_config(runs_df: pd.DataFrame, pipeline: str, alpha=None, trial_id: int = 0):
    """Get a specific run from the flattened DataFrame."""
    if pipeline == "baseline":
        mask = (runs_df["pipeline"] == "baseline") & (runs_df["trial_id"] == trial_id)
    else:
        mask = (runs_df["pipeline"] == pipeline) & (runs_df["alpha"] == alpha) & (runs_df["trial_id"] == trial_id)
    return runs_df.loc[mask, "_run"].iloc[0]
def get_run_by_config(runs_df: pd.DataFrame, pipeline: str, alpha=None, trial_id: int = 0):
    """Get a specific run from the flattened DataFrame."""
    if pipeline == "baseline":
        mask = (runs_df["pipeline"] == "baseline") & (runs_df["trial_id"] == trial_id)
    else:
        mask = (runs_df["pipeline"] == pipeline) & (runs_df["alpha"] == alpha) & (runs_df["trial_id"] == trial_id)
    return runs_df.loc[mask, "_run"].iloc[0]

In [14]:

Copied!





pasta_summary = summary[summary["config"] != "baseline"]
strongest_alpha = pasta_summary["alpha"].min()

baseline_run = get_run_by_config(runs_df, "baseline")
strong_run = get_run_by_config(runs_df, "pasta_alpha_sweep", strongest_alpha)

baseline_ex = build_per_example_df(
    baseline_run,
    generation_fields=["prompt", "response", "instruction_id_list"],
    metric_lists={
        "followed": ("StrictInstruction", "follow_all_instructions"),
        "reward": ("RewardScore", "rewards"),
    }
)
strong_ex = build_per_example_df(
    strong_run,
    generation_fields=["prompt", "response", "instruction_id_list"],
    metric_lists={
        "followed": ("StrictInstruction", "follow_all_instructions"),
        "reward": ("RewardScore", "rewards"),
    }
)

# find cases where steering fixed instruction following
comparison = baseline_ex[["idx", "followed", "reward"]].merge(
    strong_ex[["idx", "followed", "reward"]],
    on="idx", suffixes=("_base", "_strong")
)
fixed = comparison[(~comparison["followed_base"]) & (comparison["followed_strong"])].copy()
fixed["reward_delta"] = fixed["reward_strong"] - fixed["reward_base"]

fixed.sort_values("reward_delta")[["idx", "reward_base", "reward_strong", "reward_delta"]]
pasta_summary = summary[summary["config"] != "baseline"]
strongest_alpha = pasta_summary["alpha"].min()

baseline_run = get_run_by_config(runs_df, "baseline")
strong_run = get_run_by_config(runs_df, "pasta_alpha_sweep", strongest_alpha)

baseline_ex = build_per_example_df(
    baseline_run,
    generation_fields=["prompt", "response", "instruction_id_list"],
    metric_lists={
        "followed": ("StrictInstruction", "follow_all_instructions"),
        "reward": ("RewardScore", "rewards"),
    }
)
strong_ex = build_per_example_df(
    strong_run,
    generation_fields=["prompt", "response", "instruction_id_list"],
    metric_lists={
        "followed": ("StrictInstruction", "follow_all_instructions"),
        "reward": ("RewardScore", "rewards"),
    }
)

# find cases where steering fixed instruction following
comparison = baseline_ex[["idx", "followed", "reward"]].merge(
    strong_ex[["idx", "followed", "reward"]],
    on="idx", suffixes=("_base", "_strong")
)
fixed = comparison[(~comparison["followed_base"]) & (comparison["followed_strong"])].copy()
fixed["reward_delta"] = fixed["reward_strong"] - fixed["reward_base"]

fixed.sort_values("reward_delta")[["idx", "reward_base", "reward_strong", "reward_delta"]]

Out[14]:

	idx	reward_base	reward_strong	reward_delta
25	25	-0.664394	-3.911673	-3.247279
3	3	-2.936144	-3.008830	-0.072686
36	36	-1.849305	-1.631004	0.218301
44	44	-3.743225	-2.989926	0.753298
43	43	-0.902065	1.891270	2.793335

Below we print a specific example where steering improved instruction following.

In [15]:

Copied!





if not fixed.empty:
    example_idx = fixed.iloc[0]["idx"]
    base_row = baseline_ex[baseline_ex["idx"] == example_idx].iloc[0]
    steered_row = strong_ex[strong_ex["idx"] == example_idx].iloc[0]
    
    print("PROMPT:")
    print(base_row["prompt"][:500] + "..." if len(base_row["prompt"]) > 500 else base_row["prompt"])

    print("INSTRUCTION:")
    print(base_row["instruction_id_list"])

    print("BASELINE RESPONSE (followed={}, reward={:.2f}):".format(base_row["followed"], base_row["reward"]))
    print(base_row["response"][:500] + "..." if len(base_row["response"]) > 500 else base_row["response"])

    print("STEERED RESPONSE (followed={}, reward={:.2f}):".format(steered_row["followed"], steered_row["reward"]))
    print(steered_row["response"][:500] + "..." if len(steered_row["response"]) > 500 else steered_row["response"])
if not fixed.empty:
    example_idx = fixed.iloc[0]["idx"]
    base_row = baseline_ex[baseline_ex["idx"] == example_idx].iloc[0]
    steered_row = strong_ex[strong_ex["idx"] == example_idx].iloc[0]
    
    print("PROMPT:")
    print(base_row["prompt"][:500] + "..." if len(base_row["prompt"]) > 500 else base_row["prompt"])

    print("INSTRUCTION:")
    print(base_row["instruction_id_list"])

    print("BASELINE RESPONSE (followed={}, reward={:.2f}):".format(base_row["followed"], base_row["reward"]))
    print(base_row["response"][:500] + "..." if len(base_row["response"]) > 500 else base_row["response"])

    print("STEERED RESPONSE (followed={}, reward={:.2f}):".format(steered_row["followed"], steered_row["reward"]))
    print(steered_row["response"][:500] + "..." if len(steered_row["response"]) > 500 else steered_row["response"])

PROMPT:
Write a funny Haiku about a Quaker named John who lives in the town of De Smet, South Dakota.

Your response should follow the instructions below:
- Use the asterisk symbol, *, to highlight some words or phrases twice. Example: *This is a highlighted phrase*
INSTRUCTION:
['detectable_format:number_highlighted_sections']
BASELINE RESPONSE (followed=False, reward=-2.94):
In De Smet, South Dakota,
John sits on a bench, pondering,
Quakers' peace, serene.
STEERED RESPONSE (followed=True, reward=-3.01):
John lives in De Smet,
Quakers keep their secrets tight.
*The silence fills the air*, *echoes through the streets*.

Per-instruction-type breakdown¶

Different instruction types may respond differently to steering. The heatmaps below show instruction following rate and response quality across instruction types and steering strengths.

In [16]:

Copied!





def extract_per_instruction_results(profiles, evaluation_data):
    """Break down results by instruction type and alpha."""
    rows = []

    for pipeline_name, runs in profiles.items():
        for run in runs:
            alpha = (run.get("params", {}) or {}).get("PASTA", {}).get("alpha", None)
            if pipeline_name == "baseline":
                alpha = 0.0

            generations = run["generations"]
            followed_list = run["evaluations"]["StrictInstruction"]["follow_all_instructions"]
            rewards = run["evaluations"]["RewardScore"]["rewards"]

            for i, (gen, followed, reward) in enumerate(zip(generations, followed_list, rewards)):
                instr_id = gen["instruction_id_list"][0] if gen.get("instruction_id_list") else None
                rows.append({
                    "alpha": alpha,
                    "steering_strength": 0.0 if alpha == 0.0 else -np.log(alpha),
                    "instruction_type": instr_id.split(":")[-1] if instr_id else None,
                    "followed": followed,
                    "reward": reward,
                    "trial_id": run["trial_id"],
                })

    return pd.DataFrame(rows)

per_instr_df = extract_per_instruction_results(profiles, evaluation_data)

# aggregate by instruction type and steering strength
instr_summary = (
    per_instr_df
    .groupby(["instruction_type", "steering_strength"])
    .agg(
        follow_rate=("followed", "mean"),
        mean_reward=("reward", "mean"),
        n=("followed", "count")
    )
    .reset_index()
)
def extract_per_instruction_results(profiles, evaluation_data):
    """Break down results by instruction type and alpha."""
    rows = []

    for pipeline_name, runs in profiles.items():
        for run in runs:
            alpha = (run.get("params", {}) or {}).get("PASTA", {}).get("alpha", None)
            if pipeline_name == "baseline":
                alpha = 0.0

            generations = run["generations"]
            followed_list = run["evaluations"]["StrictInstruction"]["follow_all_instructions"]
            rewards = run["evaluations"]["RewardScore"]["rewards"]

            for i, (gen, followed, reward) in enumerate(zip(generations, followed_list, rewards)):
                instr_id = gen["instruction_id_list"][0] if gen.get("instruction_id_list") else None
                rows.append({
                    "alpha": alpha,
                    "steering_strength": 0.0 if alpha == 0.0 else -np.log(alpha),
                    "instruction_type": instr_id.split(":")[-1] if instr_id else None,
                    "followed": followed,
                    "reward": reward,
                    "trial_id": run["trial_id"],
                })

    return pd.DataFrame(rows)

per_instr_df = extract_per_instruction_results(profiles, evaluation_data)

# aggregate by instruction type and steering strength
instr_summary = (
    per_instr_df
    .groupby(["instruction_type", "steering_strength"])
    .agg(
        follow_rate=("followed", "mean"),
        mean_reward=("reward", "mean"),
        n=("followed", "count")
    )
    .reset_index()
)

In [17]:

Copied!





fig, axes = plt.subplots(1, 2, figsize=(14, 5))

follow_pivot = instr_summary.pivot(index="instruction_type", columns="steering_strength", values="follow_rate")
reward_pivot = instr_summary.pivot(index="instruction_type", columns="steering_strength", values="mean_reward")

plot_metric_heatmap(
    follow_pivot,
    ax=axes[0],
    title="instruction following by type and steering strength",
    xlabel="steering strength (0 = baseline)",
    vmin=0, vmax=1,
    cbar_label="follow rate",
    save_path=FIGURE_DIR / "heatmap_follow_rate.png",
)

plot_metric_heatmap(
    reward_pivot,
    ax=axes[1],
    title="response quality by type and steering strength",
    xlabel="steering strength (0 = baseline)",
    fmt=".1f",
    cbar_label="reward",
    save_path=FIGURE_DIR / "heatmap_reward.png",
)

plt.tight_layout()
# fig.savefig(FIGURE_DIR / "per_instruction_heatmaps.png", bbox_inches="tight", dpi=150)
plt.show()
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

follow_pivot = instr_summary.pivot(index="instruction_type", columns="steering_strength", values="follow_rate")
reward_pivot = instr_summary.pivot(index="instruction_type", columns="steering_strength", values="mean_reward")

plot_metric_heatmap(
    follow_pivot,
    ax=axes[0],
    title="instruction following by type and steering strength",
    xlabel="steering strength (0 = baseline)",
    vmin=0, vmax=1,
    cbar_label="follow rate",
    save_path=FIGURE_DIR / "heatmap_follow_rate.png",
)

plot_metric_heatmap(
    reward_pivot,
    ax=axes[1],
    title="response quality by type and steering strength",
    xlabel="steering strength (0 = baseline)",
    fmt=".1f",
    cbar_label="reward",
    save_path=FIGURE_DIR / "heatmap_reward.png",
)

plt.tight_layout()
# fig.savefig(FIGURE_DIR / "per_instruction_heatmaps.png", bbox_inches="tight", dpi=150)
plt.show()

Takeaway¶

PASTA steering can improve instruction following, but the optimal alpha depends on the acceptable quality tradeoff. Furthermore, steering too aggressively actually starts to degrade the model's instruction following ability (the exact thing we were steering for!). For this model and task, moderate steering (alpha in the range 10-15) typically offers the best balance between compliance and response quality.