Commonsense MCQA¶

Multiple choice question answering is a common format for evaluating a model's reasoning ability. This notebook benchmarks few-shot prompting against a LoRA adapter (trained with DPO) on the CommonsenseQA dataset. We sweep over the number of (positive) few-shot examples and study how accuracy scales relative to the fine-tuned baseline across two models.

Runtime Estimate¶

Estimated Time: 2-3 hours (fine-tuning two models on ~39k preference pairs)
Device: NVIDIA H100 GPU (80GB VRAM)

Times are approximate and vary based on dataset size, number of sweeps, and model configuration. Adjust parameters in the cells below to modify runtime.

Setup¶

In [1]:

Copied!





import json
from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
import transformers
from datasets import Dataset, load_dataset
from peft import PeftType

from aisteer360.algorithms.input_control.few_shot.control import FewShot
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO
from aisteer360.evaluation.use_cases.commonsense_mcqa.use_case import CommonsenseMCQA
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_accuracy import MCQAAccuracy
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_positional_bias import MCQAPositionalBias
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils.data_utils import flatten_profiles, get_param_values, summarize_by_config
from aisteer360.evaluation.utils.viz_utils import plot_sensitivity, plot_tradeoff

transformers.logging.set_verbosity_error()

MODELS = [
    "Qwen/Qwen2.5-0.5B-Instruct",
    "Qwen/Qwen2.5-1.5B-Instruct",
]

NOTEBOOK_DIR = Path(__file__).parent if "__file__" in dir() else Path.cwd() / "examples/notebooks/benchmark_commonsense_mcqa"
FIGURE_DIR = NOTEBOOK_DIR / "figures"
FIGURE_DIR.mkdir(exist_ok=True)

LETTERS = "ABCDE"
import json
from pathlib import Path

import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
import transformers
from datasets import Dataset, load_dataset
from peft import PeftType

from aisteer360.algorithms.input_control.few_shot.control import FewShot
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO
from aisteer360.evaluation.use_cases.commonsense_mcqa.use_case import CommonsenseMCQA
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_accuracy import MCQAAccuracy
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_positional_bias import MCQAPositionalBias
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils.data_utils import flatten_profiles, get_param_values, summarize_by_config
from aisteer360.evaluation.utils.viz_utils import plot_sensitivity, plot_tradeoff

transformers.logging.set_verbosity_error()

MODELS = [
    "Qwen/Qwen2.5-0.5B-Instruct",
    "Qwen/Qwen2.5-1.5B-Instruct",
]

NOTEBOOK_DIR = Path(__file__).parent if "__file__" in dir() else Path.cwd() / "examples/notebooks/benchmark_commonsense_mcqa"
FIGURE_DIR = NOTEBOOK_DIR / "figures"
FIGURE_DIR.mkdir(exist_ok=True)

LETTERS = "ABCDE"

/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Loading the data¶

We load the CommonsenseQA dataset from Hugging Face. The validation split is used for evaluation and the train split is used for steering (few-shot example pools and DPO training data).

In [2]:

Copied!

csqa = load_dataset("tau/commonsense_qa")

eval_split = csqa["validation"]
steer_split = csqa["train"]

print(f"Evaluation split: {len(eval_split)} questions")
print(f"Steering split: {len(steer_split)} questions")
csqa = load_dataset("tau/commonsense_qa")

eval_split = csqa["validation"]
steer_split = csqa["train"]

print(f"Evaluation split: {len(eval_split)} questions")
print(f"Steering split: {len(steer_split)} questions")

Evaluation split: 1221 questions
Steering split: 9741 questions

The CommonsenseMCQA use case expects each evaluation record to contain the question text, the correct answer text, and the full list of choices (so that it can shuffle them across runs to measure positional bias). We build these records directly from the validation split.

In [3]:

Copied!





eval_records = []
for row in eval_split:
    correct_idx = LETTERS.index(row["answerKey"])
    choices = row["choices"]["text"]
    eval_records.append({
        "id": row["id"],
        "question": row["question"],
        "answer": choices[correct_idx],
        "choices": choices,
    })

len(eval_records), eval_records[0]
eval_records = []
for row in eval_split:
    correct_idx = LETTERS.index(row["answerKey"])
    choices = row["choices"]["text"]
    eval_records.append({
        "id": row["id"],
        "question": row["question"],
        "answer": choices[correct_idx],
        "choices": choices,
    })

len(eval_records), eval_records[0]

Out[3]:

(1221,
 {'id': '1afa02df02c908a558b4036e80242fac',
  'question': 'A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?',
  'answer': 'bank',
  'choices': ['bank', 'library', 'department store', 'mall', 'new york']})

Building the use case¶

The use case of interest has already been constructed via the use case tutorial and is available at aisteer360/evaluation/use_cases/commonsense_mcqa/use_case.py. We pass in eval_records as the evaluation data for the use case.

In [ ]:

Copied!





commonsense_mcqa = CommonsenseMCQA(
    evaluation_data=eval_records,
    evaluation_metrics=[MCQAAccuracy(), MCQAPositionalBias()],
    num_samples=50,
    num_shuffling_runs=20
)
commonsense_mcqa = CommonsenseMCQA(
    evaluation_data=eval_records,
    evaluation_metrics=[MCQAAccuracy(), MCQAPositionalBias()],
    num_samples=50,
    num_shuffling_runs=20
)

Two custom metrics have been created for the use case:

MCQAAccuracy: measures the accuracy statistics of each question (across trials)
MCQAPositionalBias: measures the positional bias (via deviation from the uniform distribution across runs)

To facilitate computation of these statistics, the use case accepts a keyword argument num_shuffling_runs dictating how many times each question should be presented to the (steered) model under a randomized ordering of the choices. We restrict the number of evaluation datapoints to num_samples=50 for speed.

Preparing the steering data¶

Both steering methods draw from the train (steer) split and share a common MCQA prompt format consisting of a question with lettered choices with a single letter response. We define this format once and reuse it for both the few-shot example pools and the DPO preference pairs.

In [ ]:

Copied!





def format_mcqa_prompt(question: str, choices: list[str]) -> str:
    lines = ["You will be given a multiple-choice question and asked to select from a set of choices."]
    lines.append(f"\nQuestion: {question}\n")
    for i, choice in enumerate(choices):
        lines.append(f"{LETTERS[i]}. {choice}")
    lines.append("\nPlease only print the letter corresponding to your choice.")
    lines.append("\nAnswer:")
    return "\n".join(lines)
def format_mcqa_prompt(question: str, choices: list[str]) -> str:
    lines = ["You will be given a multiple-choice question and asked to select from a set of choices."]
    lines.append(f"\nQuestion: {question}\n")
    for i, choice in enumerate(choices):
        lines.append(f"{LETTERS[i]}. {choice}")
    lines.append("\nPlease only print the letter corresponding to your choice.")
    lines.append("\nAnswer:")
    return "\n".join(lines)

Few-shot example pools¶

For FewShot, we build positive and negative example pools from the training split. Each example is a formatted MCQA prompt paired with a letter answer, matching the format the model will see at evaluation time.

In [6]:

Copied!





positive_pool = []
negative_pool = []

for row in steer_split:
    choices = row["choices"]["text"]
    correct_idx = LETTERS.index(row["answerKey"])
    prompt = format_mcqa_prompt(row["question"], choices)

    wrong_indices = [i for i in range(len(choices)) if i != correct_idx]
    positive_pool.append({"prompt": prompt, "answer": LETTERS[correct_idx]})
    negative_pool.append({"prompt": prompt, "answer": LETTERS[wrong_indices[0]]})

print(f"Few-shot pools: {len(positive_pool)} positive, {len(negative_pool)} negative")
print(f"\nExample positive:")
print(f"  Prompt: {positive_pool[0]['prompt']}...")
print(f"  Answer: {positive_pool[0]['answer']}")
positive_pool = []
negative_pool = []

for row in steer_split:
    choices = row["choices"]["text"]
    correct_idx = LETTERS.index(row["answerKey"])
    prompt = format_mcqa_prompt(row["question"], choices)

    wrong_indices = [i for i in range(len(choices)) if i != correct_idx]
    positive_pool.append({"prompt": prompt, "answer": LETTERS[correct_idx]})
    negative_pool.append({"prompt": prompt, "answer": LETTERS[wrong_indices[0]]})

print(f"Few-shot pools: {len(positive_pool)} positive, {len(negative_pool)} negative")
print(f"\nExample positive:")
print(f"  Prompt: {positive_pool[0]['prompt']}...")
print(f"  Answer: {positive_pool[0]['answer']}")

Few-shot pools: 9741 positive, 9741 negative

Example positive:
  Prompt: You will be given a multiple-choice question and asked to select from a set of choices.

Question: The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?

A. ignore
B. enforce
C. authoritarian
D. yell at
E. avoid

Please only print the letter corresponding to your choice.

Answer:...
  Answer: A

DPO preference pairs¶

For DPO, we create preference pairs using the same prompt format. Each pair contrasts the correct letter against an incorrect one. To increase training diversity, we create up to four pairs per question by contrasting the correct answer against each wrong answer.

In [7]:

Copied!





dpo_pairs = []
for row in steer_split:
    choices = row["choices"]["text"]
    correct_idx = LETTERS.index(row["answerKey"])
    prompt = format_mcqa_prompt(row["question"], choices)

    wrong_indices = [i for i in range(len(choices)) if i != correct_idx]
    for wrong_idx in wrong_indices[:4]:
        dpo_pairs.append({
            "prompt": prompt,
            "chosen": LETTERS[correct_idx],
            "rejected": LETTERS[wrong_idx],
        })

train_ds = Dataset.from_list(dpo_pairs)

print(f"Created {len(train_ds)} DPO preference pairs from {len(steer_split)} questions")
print(f"\nExample pair:")
print(f"  Prompt: {train_ds[0]['prompt'][:200]}...")
print(f"  Chosen: {train_ds[0]['chosen']}")
print(f"  Rejected: {train_ds[0]['rejected']}")
dpo_pairs = []
for row in steer_split:
    choices = row["choices"]["text"]
    correct_idx = LETTERS.index(row["answerKey"])
    prompt = format_mcqa_prompt(row["question"], choices)

    wrong_indices = [i for i in range(len(choices)) if i != correct_idx]
    for wrong_idx in wrong_indices[:4]:
        dpo_pairs.append({
            "prompt": prompt,
            "chosen": LETTERS[correct_idx],
            "rejected": LETTERS[wrong_idx],
        })

train_ds = Dataset.from_list(dpo_pairs)

print(f"Created {len(train_ds)} DPO preference pairs from {len(steer_split)} questions")
print(f"\nExample pair:")
print(f"  Prompt: {train_ds[0]['prompt'][:200]}...")
print(f"  Chosen: {train_ds[0]['chosen']}")
print(f"  Rejected: {train_ds[0]['rejected']}")

Created 38964 DPO preference pairs from 9741 questions

Example pair:
  Prompt: You will be given a multiple-choice question and asked to select from a set of choices.

Question: The sanctions against the school were a punishing blow, and they seemed to what the efforts the schoo...
  Chosen: A
  Rejected: B

Defining the controls¶

FewShot with ControlSpec¶

One of the goals of the investigation in this notebook is to explore how the number of (in-context) examples impacts model behavior. We use the toolkit's ControlSpec class to sweep over different values of k_positive for the FewShot control. We fix k_negative=0 to isolate the effect of positive examples (pinned in the params block of the spec).

In [ ]:

Copied!





few_shot_spec = ControlSpec(
    control_cls=FewShot,
    params={
        "selector_name": "random",
        "positive_example_pool": positive_pool,
        "negative_example_pool": negative_pool,
        "k_negative": 0,
    },
    vars=[{"k_positive": k} for k in [1, 5, 10, 25, 50, 100]],
    name="FewShot",
)
few_shot_spec = ControlSpec(
    control_cls=FewShot,
    params={
        "selector_name": "random",
        "positive_example_pool": positive_pool,
        "negative_example_pool": negative_pool,
        "k_negative": 0,
    },
    vars=[{"k_positive": k} for k in [1, 5, 10, 25, 50, 100]],
    name="FewShot",
)

DPO with LoRA¶

The DPO-LoRA control fine-tunes a LoRA adapter using the preference pairs (dpo_pairs) we created above. The two models have slightly different training requirements; so we create a convenience function that populates the controls as a function of configs.

In [ ]:

Copied!





DPO_CONFIGS = {
    "Qwen/Qwen2.5-0.5B-Instruct": {
        "learning_rate": 5e-5,
        "num_train_epochs": 5,
    },
    "Qwen/Qwen2.5-1.5B-Instruct": {
        "learning_rate": 2e-5,
        "num_train_epochs": 3,
    }
}


def create_dpo_control(model_name: str) -> DPO:
    """Create a DPO control with model-specific hyperparameters."""
    short_name = model_name.split("/")[-1]
    config = DPO_CONFIGS.get(model_name, DPO_CONFIGS["Qwen/Qwen2.5-0.5B-Instruct"])

    return DPO(
        train_dataset=train_ds,

        # DPO / TRL config
        output_dir=NOTEBOOK_DIR / f"trl_models/{short_name}-DPO-Lora-Steer",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs=config["num_train_epochs"],
        learning_rate=config["learning_rate"],
        beta=0.1,
        loss_type="sigmoid",
        max_length=512,
        max_prompt_length=450,
        disable_dropout=True,
        logging_steps=200,
        save_strategy="no",
        report_to="none",
        seed=123,

        # LoRA config
        use_peft=True,
        peft_type=PeftType.LORA,
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        adapter_name="dpo",
        merge_lora_after_train=False,
    )
DPO_CONFIGS = {
    "Qwen/Qwen2.5-0.5B-Instruct": {
        "learning_rate": 5e-5,
        "num_train_epochs": 5,
    },
    "Qwen/Qwen2.5-1.5B-Instruct": {
        "learning_rate": 2e-5,
        "num_train_epochs": 3,
    }
}


def create_dpo_control(model_name: str) -> DPO:
    """Create a DPO control with model-specific hyperparameters."""
    short_name = model_name.split("/")[-1]
    config = DPO_CONFIGS.get(model_name, DPO_CONFIGS["Qwen/Qwen2.5-0.5B-Instruct"])

    return DPO(
        train_dataset=train_ds,

        # DPO / TRL config
        output_dir=NOTEBOOK_DIR / f"trl_models/{short_name}-DPO-Lora-Steer",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        num_train_epochs=config["num_train_epochs"],
        learning_rate=config["learning_rate"],
        beta=0.1,
        loss_type="sigmoid",
        max_length=512,
        max_prompt_length=450,
        disable_dropout=True,
        logging_steps=200,
        save_strategy="no",
        report_to="none",
        seed=123,

        # LoRA config
        use_peft=True,
        peft_type=PeftType.LORA,
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        adapter_name="dpo",
        merge_lora_after_train=False,
    )

Running the benchmark¶

The benchmark compares three steering approaches across multiple model sizes:

"baseline": Unsteered model
"few_shot_sweep": FewShot with varying k_positive (1, 5, 10, 25, 50)
"dpo_lora": DPO-trained LoRA adapter

We run the benchmark with num_trials=5 to capture statistical variability across generation runs (at the cost of slower execution).

In [ ]:

Copied!





all_profiles = {}

for model_name in MODELS:
    short_name = model_name.split("/")[-1]
    print(f"Running benchmark for {short_name}")

    dpo_lora = create_dpo_control(model_name)

    benchmark = Benchmark(
        use_case=commonsense_mcqa,
        base_model_name_or_path=model_name,
        steering_pipelines={
            "baseline": [],
            "few_shot_sweep": [few_shot_spec],
            "dpo_lora": [dpo_lora],
        },
        gen_kwargs={"max_new_tokens": 300, "do_sample": True, "temperature": 0.7},
        device_map="auto",
        num_trials=5,
        save_dir=NOTEBOOK_DIR / f"profiles_{short_name}",
    )

    profiles = benchmark.run()
    all_profiles[short_name] = profiles

    benchmark.export(profiles, save_dir=f"./profiles/{short_name}/")
all_profiles = {}

for model_name in MODELS:
    short_name = model_name.split("/")[-1]
    print(f"Running benchmark for {short_name}")

    dpo_lora = create_dpo_control(model_name)

    benchmark = Benchmark(
        use_case=commonsense_mcqa,
        base_model_name_or_path=model_name,
        steering_pipelines={
            "baseline": [],
            "few_shot_sweep": [few_shot_spec],
            "dpo_lora": [dpo_lora],
        },
        gen_kwargs={"max_new_tokens": 300, "do_sample": True, "temperature": 0.7},
        device_map="auto",
        num_trials=5,
        save_dir=NOTEBOOK_DIR / f"profiles_{short_name}",
    )

    profiles = benchmark.run()
    all_profiles[short_name] = profiles

    benchmark.export(profiles, save_dir=f"./profiles/{short_name}/")

Running benchmark for Qwen2.5-0.5B-Instruct

Resumed from checkpoint: 40 run(s) across 3 pipeline(s).
Running pipeline: baseline...
done.
Running pipeline: few_shot_sweep...
  Skipping config db660184; restored 5 run(s) from checkpoint.
  Skipping config 3fa35b3e; restored 5 run(s) from checkpoint.
  Skipping config 7d3665fe; restored 5 run(s) from checkpoint.
  Skipping config 453046c6; restored 5 run(s) from checkpoint.
  Skipping config 8c3d0483; restored 5 run(s) from checkpoint.
  Skipping config 4d3eafee; restored 5 run(s) from checkpoint.
done.
Running pipeline: dpo_lora...
done.
Running benchmark for Qwen2.5-1.5B-Instruct
Resumed from checkpoint: 30 run(s) across 2 pipeline(s).
Running pipeline: baseline...
done.
Running pipeline: few_shot_sweep...
  Skipping config db660184; restored 5 run(s) from checkpoint.
  Skipping config 3fa35b3e; restored 5 run(s) from checkpoint.
  Skipping config 7d3665fe; restored 5 run(s) from checkpoint.
  Skipping config 453046c6; restored 5 run(s) from checkpoint.
  Skipping config 8c3d0483; restored 5 run(s) from checkpoint.
Running configuration 6...

Analysis¶

We now analyze the benchmark results across both models.

First, we flatten the nested profiles into a single DataFrame with one row per trial (using the toolkit's utility flatten_profiles), then aggregate across trials to get mean and standard deviation.

In [11]:

Copied!





dfs = []
for model_name, profiles in all_profiles.items():
    df = flatten_profiles(
        profiles,
        metric_accessors={
            "accuracy": ("MCQAAccuracy", "question_mean"),
            "positional_bias": ("MCQAPositionalBias", "mean"),
        }
    )
    df["model"] = model_name
    df["k_positive"] = get_param_values(df, "FewShot", "k_positive")
    dfs.append(df)

runs_df = pd.concat(dfs, ignore_index=True)
runs_df[["model", "pipeline", "trial_id", "k_positive", "accuracy", "positional_bias"]]
dfs = []
for model_name, profiles in all_profiles.items():
    df = flatten_profiles(
        profiles,
        metric_accessors={
            "accuracy": ("MCQAAccuracy", "question_mean"),
            "positional_bias": ("MCQAPositionalBias", "mean"),
        }
    )
    df["model"] = model_name
    df["k_positive"] = get_param_values(df, "FewShot", "k_positive")
    dfs.append(df)

runs_df = pd.concat(dfs, ignore_index=True)
runs_df[["model", "pipeline", "trial_id", "k_positive", "accuracy", "positional_bias"]]

Out[11]:

	model	pipeline	trial_id	k_positive	accuracy	positional_bias
0	Qwen2.5-0.5B-Instruct	baseline	0	NaN	0.40	0.075600
1	Qwen2.5-0.5B-Instruct	baseline	1	NaN	0.46	0.073200
2	Qwen2.5-0.5B-Instruct	baseline	2	NaN	0.42	0.075200
3	Qwen2.5-0.5B-Instruct	baseline	3	NaN	0.42	0.074800
4	Qwen2.5-0.5B-Instruct	baseline	4	NaN	0.46	0.075200
...	...	...	...	...	...	...
75	Qwen2.5-1.5B-Instruct	dpo_lora	0	NaN	0.86	0.020000
76	Qwen2.5-1.5B-Instruct	dpo_lora	1	NaN	0.80	0.055222
77	Qwen2.5-1.5B-Instruct	dpo_lora	2	NaN	0.88	0.016400
78	Qwen2.5-1.5B-Instruct	dpo_lora	3	NaN	0.88	0.012800
79	Qwen2.5-1.5B-Instruct	dpo_lora	4	NaN	0.84	0.026000

80 rows × 6 columns

Next, we summarize by configuration (aggregating across trials) then attach the corresponding k_positive value for each of the few-shot rows.

In [12]:

Copied!





summary_df = summarize_by_config(
    runs_df,
    metric_cols=["accuracy", "positional_bias"],
    group_cols=["model", "pipeline", "config_id"]
)

k_map = runs_df.groupby(["model", "pipeline", "config_id"])["k_positive"].first()
summary_df["k_positive"] = summary_df.apply(
    lambda row: k_map.get((row["model"], row["pipeline"], row["config_id"]), np.nan), axis=1
)

summary_df[["model", "pipeline", "k_positive", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]].round(3)
summary_df = summarize_by_config(
    runs_df,
    metric_cols=["accuracy", "positional_bias"],
    group_cols=["model", "pipeline", "config_id"]
)

k_map = runs_df.groupby(["model", "pipeline", "config_id"])["k_positive"].first()
summary_df["k_positive"] = summary_df.apply(
    lambda row: k_map.get((row["model"], row["pipeline"], row["config_id"]), np.nan), axis=1
)

summary_df[["model", "pipeline", "k_positive", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]].round(3)

Out[12]:

	model	pipeline	k_positive	n_trials	accuracy_mean	accuracy_std	positional_bias_mean	positional_bias_std
0	Qwen2.5-0.5B-Instruct	baseline	NaN	5.0	0.432	0.027	0.075	0.001
1	Qwen2.5-0.5B-Instruct	few_shot_sweep	1.0	5.0	0.384	0.043	0.128	0.001
2	Qwen2.5-0.5B-Instruct	few_shot_sweep	5.0	5.0	0.428	0.011	0.117	0.007
3	Qwen2.5-0.5B-Instruct	few_shot_sweep	10.0	5.0	0.468	0.023	0.113	0.003
4	Qwen2.5-0.5B-Instruct	few_shot_sweep	25.0	5.0	0.520	0.028	0.093	0.009
5	Qwen2.5-0.5B-Instruct	few_shot_sweep	50.0	5.0	0.528	0.050	0.087	0.003
6	Qwen2.5-0.5B-Instruct	few_shot_sweep	100.0	5.0	0.480	0.037	0.083	0.003
7	Qwen2.5-0.5B-Instruct	dpo_lora	NaN	5.0	0.624	0.017	0.093	0.001
8	Qwen2.5-1.5B-Instruct	baseline	NaN	5.0	0.760	0.037	0.015	0.004
9	Qwen2.5-1.5B-Instruct	few_shot_sweep	1.0	5.0	0.736	0.026	0.025	0.007
10	Qwen2.5-1.5B-Instruct	few_shot_sweep	5.0	5.0	0.764	0.026	0.023	0.006
11	Qwen2.5-1.5B-Instruct	few_shot_sweep	10.0	5.0	0.768	0.023	0.023	0.007
12	Qwen2.5-1.5B-Instruct	few_shot_sweep	25.0	5.0	0.788	0.023	0.026	0.003
13	Qwen2.5-1.5B-Instruct	few_shot_sweep	50.0	5.0	0.784	0.033	0.029	0.002
14	Qwen2.5-1.5B-Instruct	few_shot_sweep	100.0	5.0	0.800	0.014	0.025	0.002
15	Qwen2.5-1.5B-Instruct	dpo_lora	NaN	5.0	0.852	0.033	0.026	0.017

DPO vs. FewShot¶

We now examine how the DPO-LoRA control performs in comparison to FewShot, particularly as we scale the number of (positive) examples. Both the DPO-LoRA control and baseline (unsteered) pipelines are shown as horizontal reference lines passed in using the compare_to_pipelines argument.

In [13]:

Copied!





few_shot_df = summary_df[summary_df["pipeline"] == "few_shot_sweep"].copy()
few_shot_df = few_shot_df.sort_values(["model", "k_positive"])

# common axis limits
all_accuracy = runs_df["accuracy"].dropna()
ylim_accuracy = (max(0, all_accuracy.min() - 0.1), min(1, all_accuracy.max() + 0.1))

n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 4))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)

for idx, model_name in enumerate(MODELS):
    short_name = model_name.split("/")[-1]
    ax = fig.add_subplot(gs[0, idx])

    # extract data under each pipeline
    model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
    model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
    model_dpo = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]

    # individual trial data (for scatter overlay)
    model_trials = runs_df[(runs_df["model"] == short_name) & (runs_df["pipeline"] == "few_shot_sweep")]
    
    plot_sensitivity(
        swept=model_swept,
        metric="accuracy",
        sweep_col="k_positive",
        per_trial_data=model_trials,
        compare_to_pipelines=[
            ("baseline", model_baseline),
            ("DPO-LoRA", model_dpo),
        ],
        ax=ax,
        metric_label="accuracy",
        sweep_label="k_positive",
        title=short_name,
        ylim=ylim_accuracy,
    )

fig.savefig(FIGURE_DIR / "sensitivity_accuracy.png", bbox_inches="tight", dpi=150)
plt.show()
few_shot_df = summary_df[summary_df["pipeline"] == "few_shot_sweep"].copy()
few_shot_df = few_shot_df.sort_values(["model", "k_positive"])

# common axis limits
all_accuracy = runs_df["accuracy"].dropna()
ylim_accuracy = (max(0, all_accuracy.min() - 0.1), min(1, all_accuracy.max() + 0.1))

n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 4))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)

for idx, model_name in enumerate(MODELS):
    short_name = model_name.split("/")[-1]
    ax = fig.add_subplot(gs[0, idx])

    # extract data under each pipeline
    model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
    model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
    model_dpo = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]

    # individual trial data (for scatter overlay)
    model_trials = runs_df[(runs_df["model"] == short_name) & (runs_df["pipeline"] == "few_shot_sweep")]
    
    plot_sensitivity(
        swept=model_swept,
        metric="accuracy",
        sweep_col="k_positive",
        per_trial_data=model_trials,
        compare_to_pipelines=[
            ("baseline", model_baseline),
            ("DPO-LoRA", model_dpo),
        ],
        ax=ax,
        metric_label="accuracy",
        sweep_label="k_positive",
        title=short_name,
        ylim=ylim_accuracy,
    )

fig.savefig(FIGURE_DIR / "sensitivity_accuracy.png", bbox_inches="tight", dpi=150)
plt.show()

No description has been provided for this image

We can see that fine-tuning (under DPO-LoRA) creates a jump in performance for both models. For the FewShot control, adding a single example actually causes the model to degrade compared to the baseline. Increasing the number of examples generally does improve performance, although accuracy declines after 50 examples for the 0.5B model (and appears to saturate for the 1.5B model).

Accuracy vs positional bias tradeoff¶

We now examine whether there is a tradeoff between accuracy and positional bias across methods. The FewShot configurations are colored by k_positive, with the baseline shown as a black X marker and DPO-LoRA as a red square. The Pareto frontier indicates configurations that are not dominated by any other.

In [14]:

Copied!





# common axis limits
all_accuracy = runs_df["accuracy"].dropna()
all_bias = runs_df["positional_bias"].dropna()
xlim_tradeoff = (max(0, all_accuracy.min() - 0.05), min(1, all_accuracy.max() + 0.05))
ylim_tradeoff = (max(0, all_bias.min() - 0.02), all_bias.max() + 0.02)

n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 5))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)

for idx, model_name in enumerate(MODELS):
    short_name = model_name.split("/")[-1]
    ax = fig.add_subplot(gs[0, idx])

    model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
    model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
    model_dpo = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]

    plot_tradeoff(
        swept=model_swept,
        x_metric="accuracy",
        y_metric="positional_bias",
        sweep_col="k_positive",
        compare_to_pipelines=[
            ("baseline", model_baseline),
            ("DPO-LoRA", model_dpo),
        ],
        ax=ax,
        x_label="accuracy",
        y_label="positional bias",
        sweep_label="k_positive",
        title=short_name,
        show_pareto=True,
        maximize_x=True,
        maximize_y=False,
        xlim=xlim_tradeoff,
        ylim=ylim_tradeoff,
    )

fig.savefig(FIGURE_DIR / "tradeoff.png", bbox_inches="tight", dpi=150)
plt.show()
# common axis limits
all_accuracy = runs_df["accuracy"].dropna()
all_bias = runs_df["positional_bias"].dropna()
xlim_tradeoff = (max(0, all_accuracy.min() - 0.05), min(1, all_accuracy.max() + 0.05))
ylim_tradeoff = (max(0, all_bias.min() - 0.02), all_bias.max() + 0.02)

n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 5))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)

for idx, model_name in enumerate(MODELS):
    short_name = model_name.split("/")[-1]
    ax = fig.add_subplot(gs[0, idx])

    model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
    model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
    model_dpo = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]

    plot_tradeoff(
        swept=model_swept,
        x_metric="accuracy",
        y_metric="positional_bias",
        sweep_col="k_positive",
        compare_to_pipelines=[
            ("baseline", model_baseline),
            ("DPO-LoRA", model_dpo),
        ],
        ax=ax,
        x_label="accuracy",
        y_label="positional bias",
        sweep_label="k_positive",
        title=short_name,
        show_pareto=True,
        maximize_x=True,
        maximize_y=False,
        xlim=xlim_tradeoff,
        ylim=ylim_tradeoff,
    )

fig.savefig(FIGURE_DIR / "tradeoff.png", bbox_inches="tight", dpi=150)
plt.show()

Generally, it appears that few-shot steering under a small-to-moderate number of examples causes the positional bias to jump even if the accuracy improves. Interestingly, in the 0.5B model, as the number of examples increases (to 25-100), the positional bias starts to fall while accuracy continues to improve. The DPO-trained model generally sees the highest accuracy with a slightly higher positional bias than the best few-shot case (50 examples). This observation is similar but less pronounced in the 1.5B model.

Summary table¶

The table below summarizes all configurations ranked by accuracy for all methods/models.

In [15]:

Copied!





method_order = ["baseline", "FewShot (k=1)", "FewShot (k=5)", "FewShot (k=10)", "FewShot (k=25)", "FewShot (k=50)", "FewShot (k=100)", "DPO-LoRA"]

summary_table = summary_df.copy()
summary_table["method"] = summary_table.apply(
    lambda row: "baseline" if row["pipeline"] == "baseline"
    else "DPO-LoRA" if row["pipeline"] == "dpo_lora"
    else f"FewShot (k={int(row['k_positive'])})",
    axis=1
)

model_order = [m.split("/")[-1] for m in MODELS]
summary_table["model_order"] = summary_table["model"].apply(lambda m: model_order.index(m) if m in model_order else len(model_order))
summary_table["method_order"] = summary_table["method"].apply(lambda m: method_order.index(m) if m in method_order else len(method_order))

display_df = summary_table.sort_values(["model_order", "method_order"])[
    ["model", "method", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]
].copy()
display_df.columns = ["model", "method", "trials", "accuracy (mean)", "accuracy (std)", "pos bias (mean)", "pos bias (std)"]

display_df.style.format({
    "accuracy (mean)": "{:.1%}",
    "accuracy (std)": "{:.1%}",
    "pos bias (mean)": "{:.3f}",
    "pos bias (std)": "{:.3f}",
}).background_gradient(subset=["accuracy (mean)"], cmap="RdYlGn")
method_order = ["baseline", "FewShot (k=1)", "FewShot (k=5)", "FewShot (k=10)", "FewShot (k=25)", "FewShot (k=50)", "FewShot (k=100)", "DPO-LoRA"]

summary_table = summary_df.copy()
summary_table["method"] = summary_table.apply(
    lambda row: "baseline" if row["pipeline"] == "baseline"
    else "DPO-LoRA" if row["pipeline"] == "dpo_lora"
    else f"FewShot (k={int(row['k_positive'])})",
    axis=1
)

model_order = [m.split("/")[-1] for m in MODELS]
summary_table["model_order"] = summary_table["model"].apply(lambda m: model_order.index(m) if m in model_order else len(model_order))
summary_table["method_order"] = summary_table["method"].apply(lambda m: method_order.index(m) if m in method_order else len(method_order))

display_df = summary_table.sort_values(["model_order", "method_order"])[
    ["model", "method", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]
].copy()
display_df.columns = ["model", "method", "trials", "accuracy (mean)", "accuracy (std)", "pos bias (mean)", "pos bias (std)"]

display_df.style.format({
    "accuracy (mean)": "{:.1%}",
    "accuracy (std)": "{:.1%}",
    "pos bias (mean)": "{:.3f}",
    "pos bias (std)": "{:.3f}",
}).background_gradient(subset=["accuracy (mean)"], cmap="RdYlGn")

Out[15]:

	model	method	trials	accuracy (mean)	accuracy (std)	pos bias (mean)	pos bias (std)
0	Qwen2.5-0.5B-Instruct	baseline	5.000000	43.2%	2.7%	0.075	0.001
1	Qwen2.5-0.5B-Instruct	FewShot (k=1)	5.000000	38.4%	4.3%	0.128	0.001
2	Qwen2.5-0.5B-Instruct	FewShot (k=5)	5.000000	42.8%	1.1%	0.117	0.007
3	Qwen2.5-0.5B-Instruct	FewShot (k=10)	5.000000	46.8%	2.3%	0.113	0.003
4	Qwen2.5-0.5B-Instruct	FewShot (k=25)	5.000000	52.0%	2.8%	0.093	0.009
5	Qwen2.5-0.5B-Instruct	FewShot (k=50)	5.000000	52.8%	5.0%	0.087	0.003
6	Qwen2.5-0.5B-Instruct	FewShot (k=100)	5.000000	48.0%	3.7%	0.083	0.003
7	Qwen2.5-0.5B-Instruct	DPO-LoRA	5.000000	62.4%	1.7%	0.093	0.001
8	Qwen2.5-1.5B-Instruct	baseline	5.000000	76.0%	3.7%	0.015	0.004
9	Qwen2.5-1.5B-Instruct	FewShot (k=1)	5.000000	73.6%	2.6%	0.025	0.007
10	Qwen2.5-1.5B-Instruct	FewShot (k=5)	5.000000	76.4%	2.6%	0.023	0.006
11	Qwen2.5-1.5B-Instruct	FewShot (k=10)	5.000000	76.8%	2.3%	0.023	0.007
12	Qwen2.5-1.5B-Instruct	FewShot (k=25)	5.000000	78.8%	2.3%	0.026	0.003
13	Qwen2.5-1.5B-Instruct	FewShot (k=50)	5.000000	78.4%	3.3%	0.029	0.002
14	Qwen2.5-1.5B-Instruct	FewShot (k=100)	5.000000	80.0%	1.4%	0.025	0.002
15	Qwen2.5-1.5B-Instruct	DPO-LoRA	5.000000	85.2%	3.3%	0.026	0.017

Takeaways¶

This notebook compared the effectiveness of LoRA adapters with few-shot learning on a commonsense MCQA task. For the commonsense MCQA task under the models studied (Qwen/Qwen2.5-0.5B-Instruct and Qwen/Qwen2.5-1.5B-Instruct), fine-tuning outperforms FewShot in both models. A single example degrades performance compared to baseline (in both models). Positional bias increases under a small-moderate number of examples but falls as examples increase further (25-100). The accuracy gains under few-shot prompting appear to saturate, or even degrade, as we increase the number of examples and generally seem to achieve half of the gains of the DPO-trained models.