Commonsense MCQA¶
Multiple choice question answering is a common format for evaluating a model's reasoning ability. This notebook benchmarks few-shot prompting against a LoRA adapter (trained with DPO) on the CommonsenseQA dataset. We sweep over the number of (positive) few-shot examples and study how accuracy scales relative to the fine-tuned baseline across two models.
Runtime Estimate¶
Estimated Time: 2-3 hours (fine-tuning two models on ~39k preference pairs)
Device: NVIDIA H100 GPU (80GB VRAM)
Times are approximate and vary based on dataset size, number of sweeps, and model configuration. Adjust parameters in the cells below to modify runtime.
Setup¶
import json
from pathlib import Path
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
import pandas as pd
import transformers
from datasets import Dataset, load_dataset
from peft import PeftType
from aisteer360.algorithms.input_control.few_shot.control import FewShot
from aisteer360.algorithms.core.specs import ControlSpec
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO
from aisteer360.evaluation.use_cases.commonsense_mcqa.use_case import CommonsenseMCQA
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_accuracy import MCQAAccuracy
from aisteer360.evaluation.metrics.custom.commonsense_mcqa.mcqa_positional_bias import MCQAPositionalBias
from aisteer360.evaluation.benchmark import Benchmark
from aisteer360.evaluation.utils.data_utils import flatten_profiles, get_param_values, summarize_by_config
from aisteer360.evaluation.utils.viz_utils import plot_sensitivity, plot_tradeoff
transformers.logging.set_verbosity_error()
MODELS = [
"Qwen/Qwen2.5-0.5B-Instruct",
"Qwen/Qwen2.5-1.5B-Instruct",
]
NOTEBOOK_DIR = Path(__file__).parent if "__file__" in dir() else Path.cwd() / "examples/notebooks/benchmark_commonsense_mcqa"
FIGURE_DIR = NOTEBOOK_DIR / "figures"
FIGURE_DIR.mkdir(exist_ok=True)
LETTERS = "ABCDE"
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Loading the data¶
We load the CommonsenseQA dataset from Hugging Face. The validation split is used for evaluation and the train split is used for steering (few-shot example pools and DPO training data).
csqa = load_dataset("tau/commonsense_qa")
eval_split = csqa["validation"]
steer_split = csqa["train"]
print(f"Evaluation split: {len(eval_split)} questions")
print(f"Steering split: {len(steer_split)} questions")
Evaluation split: 1221 questions Steering split: 9741 questions
The CommonsenseMCQA use case expects each evaluation record to contain the question text, the correct answer text,
and the full list of choices (so that it can shuffle them across runs to measure positional bias). We build these records
directly from the validation split.
eval_records = []
for row in eval_split:
correct_idx = LETTERS.index(row["answerKey"])
choices = row["choices"]["text"]
eval_records.append({
"id": row["id"],
"question": row["question"],
"answer": choices[correct_idx],
"choices": choices,
})
len(eval_records), eval_records[0]
(1221,
{'id': '1afa02df02c908a558b4036e80242fac',
'question': 'A revolving door is convenient for two direction travel, but it also serves as a security measure at a what?',
'answer': 'bank',
'choices': ['bank', 'library', 'department store', 'mall', 'new york']})
commonsense_mcqa = CommonsenseMCQA(
evaluation_data=eval_records,
evaluation_metrics=[MCQAAccuracy(), MCQAPositionalBias()],
num_samples=50,
num_shuffling_runs=20
)
Two custom metrics have been created for the use case:
MCQAAccuracy: measures the accuracy statistics of each question (across trials)MCQAPositionalBias: measures the positional bias (via deviation from the uniform distribution across runs)
To facilitate computation of these statistics, the use case accepts a keyword argument num_shuffling_runs dictating
how many times each question should be presented to the (steered) model under a randomized ordering of the choices.
We restrict the number of evaluation datapoints to num_samples=50 for speed.
Preparing the steering data¶
Both steering methods draw from the train (steer) split and share a common MCQA prompt format consisting of a question with lettered choices with a single letter response. We define this format once and reuse it for both the few-shot example pools and the DPO preference pairs.
def format_mcqa_prompt(question: str, choices: list[str]) -> str:
lines = ["You will be given a multiple-choice question and asked to select from a set of choices."]
lines.append(f"\nQuestion: {question}\n")
for i, choice in enumerate(choices):
lines.append(f"{LETTERS[i]}. {choice}")
lines.append("\nPlease only print the letter corresponding to your choice.")
lines.append("\nAnswer:")
return "\n".join(lines)
Few-shot example pools¶
For FewShot, we build positive and negative example pools from the training split. Each example is a formatted MCQA prompt paired with a letter answer, matching the format the model will see at evaluation time.
positive_pool = []
negative_pool = []
for row in steer_split:
choices = row["choices"]["text"]
correct_idx = LETTERS.index(row["answerKey"])
prompt = format_mcqa_prompt(row["question"], choices)
wrong_indices = [i for i in range(len(choices)) if i != correct_idx]
positive_pool.append({"prompt": prompt, "answer": LETTERS[correct_idx]})
negative_pool.append({"prompt": prompt, "answer": LETTERS[wrong_indices[0]]})
print(f"Few-shot pools: {len(positive_pool)} positive, {len(negative_pool)} negative")
print(f"\nExample positive:")
print(f" Prompt: {positive_pool[0]['prompt']}...")
print(f" Answer: {positive_pool[0]['answer']}")
Few-shot pools: 9741 positive, 9741 negative Example positive: Prompt: You will be given a multiple-choice question and asked to select from a set of choices. Question: The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change? A. ignore B. enforce C. authoritarian D. yell at E. avoid Please only print the letter corresponding to your choice. Answer:... Answer: A
DPO preference pairs¶
For DPO, we create preference pairs using the same prompt format. Each pair contrasts the correct letter against an incorrect one. To increase training diversity, we create up to four pairs per question by contrasting the correct answer against each wrong answer.
dpo_pairs = []
for row in steer_split:
choices = row["choices"]["text"]
correct_idx = LETTERS.index(row["answerKey"])
prompt = format_mcqa_prompt(row["question"], choices)
wrong_indices = [i for i in range(len(choices)) if i != correct_idx]
for wrong_idx in wrong_indices[:4]:
dpo_pairs.append({
"prompt": prompt,
"chosen": LETTERS[correct_idx],
"rejected": LETTERS[wrong_idx],
})
train_ds = Dataset.from_list(dpo_pairs)
print(f"Created {len(train_ds)} DPO preference pairs from {len(steer_split)} questions")
print(f"\nExample pair:")
print(f" Prompt: {train_ds[0]['prompt'][:200]}...")
print(f" Chosen: {train_ds[0]['chosen']}")
print(f" Rejected: {train_ds[0]['rejected']}")
Created 38964 DPO preference pairs from 9741 questions Example pair: Prompt: You will be given a multiple-choice question and asked to select from a set of choices. Question: The sanctions against the school were a punishing blow, and they seemed to what the efforts the schoo... Chosen: A Rejected: B
Defining the controls¶
FewShot with ControlSpec¶
One of the goals of the investigation in this notebook is to explore how the number of (in-context) examples impacts model behavior. We use the toolkit's ControlSpec class to sweep over different values of k_positive for the FewShot control. We fix k_negative=0 to isolate the effect of positive examples (pinned in the params block of the spec).
few_shot_spec = ControlSpec(
control_cls=FewShot,
params={
"selector_name": "random",
"positive_example_pool": positive_pool,
"negative_example_pool": negative_pool,
"k_negative": 0,
},
vars=[{"k_positive": k} for k in [1, 5, 10, 25, 50, 100]],
name="FewShot",
)
DPO with LoRA¶
The DPO-LoRA control fine-tunes a LoRA adapter using the preference pairs (dpo_pairs) we created above. The two models have slightly different training requirements; so we create a convenience function that populates the controls as a function of configs.
DPO_CONFIGS = {
"Qwen/Qwen2.5-0.5B-Instruct": {
"learning_rate": 5e-5,
"num_train_epochs": 5,
},
"Qwen/Qwen2.5-1.5B-Instruct": {
"learning_rate": 2e-5,
"num_train_epochs": 3,
}
}
def create_dpo_control(model_name: str) -> DPO:
"""Create a DPO control with model-specific hyperparameters."""
short_name = model_name.split("/")[-1]
config = DPO_CONFIGS.get(model_name, DPO_CONFIGS["Qwen/Qwen2.5-0.5B-Instruct"])
return DPO(
train_dataset=train_ds,
# DPO / TRL config
output_dir=NOTEBOOK_DIR / f"trl_models/{short_name}-DPO-Lora-Steer",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
num_train_epochs=config["num_train_epochs"],
learning_rate=config["learning_rate"],
beta=0.1,
loss_type="sigmoid",
max_length=512,
max_prompt_length=450,
disable_dropout=True,
logging_steps=200,
save_strategy="no",
report_to="none",
seed=123,
# LoRA config
use_peft=True,
peft_type=PeftType.LORA,
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
adapter_name="dpo",
merge_lora_after_train=False,
)
Running the benchmark¶
The benchmark compares three steering approaches across multiple model sizes:
- "baseline": Unsteered model
- "few_shot_sweep": FewShot with varying
k_positive(1, 5, 10, 25, 50) - "dpo_lora": DPO-trained LoRA adapter
We run the benchmark with num_trials=5 to capture statistical variability across generation runs (at the cost of slower execution).
all_profiles = {}
for model_name in MODELS:
short_name = model_name.split("/")[-1]
print(f"Running benchmark for {short_name}")
dpo_lora = create_dpo_control(model_name)
benchmark = Benchmark(
use_case=commonsense_mcqa,
base_model_name_or_path=model_name,
steering_pipelines={
"baseline": [],
"few_shot_sweep": [few_shot_spec],
"dpo_lora": [dpo_lora],
},
gen_kwargs={"max_new_tokens": 300, "do_sample": True, "temperature": 0.7},
device_map="auto",
num_trials=5,
save_dir=NOTEBOOK_DIR / f"profiles_{short_name}",
)
profiles = benchmark.run()
all_profiles[short_name] = profiles
benchmark.export(profiles, save_dir=f"./profiles/{short_name}/")
Running benchmark for Qwen2.5-0.5B-Instruct
Resumed from checkpoint: 40 run(s) across 3 pipeline(s). Running pipeline: baseline... done. Running pipeline: few_shot_sweep... Skipping config db660184; restored 5 run(s) from checkpoint. Skipping config 3fa35b3e; restored 5 run(s) from checkpoint. Skipping config 7d3665fe; restored 5 run(s) from checkpoint. Skipping config 453046c6; restored 5 run(s) from checkpoint. Skipping config 8c3d0483; restored 5 run(s) from checkpoint. Skipping config 4d3eafee; restored 5 run(s) from checkpoint. done. Running pipeline: dpo_lora... done. Running benchmark for Qwen2.5-1.5B-Instruct Resumed from checkpoint: 30 run(s) across 2 pipeline(s). Running pipeline: baseline... done. Running pipeline: few_shot_sweep... Skipping config db660184; restored 5 run(s) from checkpoint. Skipping config 3fa35b3e; restored 5 run(s) from checkpoint. Skipping config 7d3665fe; restored 5 run(s) from checkpoint. Skipping config 453046c6; restored 5 run(s) from checkpoint. Skipping config 8c3d0483; restored 5 run(s) from checkpoint. Running configuration 6...
Analysis¶
We now analyze the benchmark results across both models.
First, we flatten the nested profiles into a single DataFrame with one row per trial (using the toolkit's utility flatten_profiles), then aggregate across trials to get mean and standard deviation.
dfs = []
for model_name, profiles in all_profiles.items():
df = flatten_profiles(
profiles,
metric_accessors={
"accuracy": ("MCQAAccuracy", "question_mean"),
"positional_bias": ("MCQAPositionalBias", "mean"),
}
)
df["model"] = model_name
df["k_positive"] = get_param_values(df, "FewShot", "k_positive")
dfs.append(df)
runs_df = pd.concat(dfs, ignore_index=True)
runs_df[["model", "pipeline", "trial_id", "k_positive", "accuracy", "positional_bias"]]
| model | pipeline | trial_id | k_positive | accuracy | positional_bias | |
|---|---|---|---|---|---|---|
| 0 | Qwen2.5-0.5B-Instruct | baseline | 0 | NaN | 0.40 | 0.075600 |
| 1 | Qwen2.5-0.5B-Instruct | baseline | 1 | NaN | 0.46 | 0.073200 |
| 2 | Qwen2.5-0.5B-Instruct | baseline | 2 | NaN | 0.42 | 0.075200 |
| 3 | Qwen2.5-0.5B-Instruct | baseline | 3 | NaN | 0.42 | 0.074800 |
| 4 | Qwen2.5-0.5B-Instruct | baseline | 4 | NaN | 0.46 | 0.075200 |
| ... | ... | ... | ... | ... | ... | ... |
| 75 | Qwen2.5-1.5B-Instruct | dpo_lora | 0 | NaN | 0.86 | 0.020000 |
| 76 | Qwen2.5-1.5B-Instruct | dpo_lora | 1 | NaN | 0.80 | 0.055222 |
| 77 | Qwen2.5-1.5B-Instruct | dpo_lora | 2 | NaN | 0.88 | 0.016400 |
| 78 | Qwen2.5-1.5B-Instruct | dpo_lora | 3 | NaN | 0.88 | 0.012800 |
| 79 | Qwen2.5-1.5B-Instruct | dpo_lora | 4 | NaN | 0.84 | 0.026000 |
80 rows × 6 columns
Next, we summarize by configuration (aggregating across trials) then attach the corresponding k_positive value for each of the few-shot rows.
summary_df = summarize_by_config(
runs_df,
metric_cols=["accuracy", "positional_bias"],
group_cols=["model", "pipeline", "config_id"]
)
k_map = runs_df.groupby(["model", "pipeline", "config_id"])["k_positive"].first()
summary_df["k_positive"] = summary_df.apply(
lambda row: k_map.get((row["model"], row["pipeline"], row["config_id"]), np.nan), axis=1
)
summary_df[["model", "pipeline", "k_positive", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]].round(3)
| model | pipeline | k_positive | n_trials | accuracy_mean | accuracy_std | positional_bias_mean | positional_bias_std | |
|---|---|---|---|---|---|---|---|---|
| 0 | Qwen2.5-0.5B-Instruct | baseline | NaN | 5.0 | 0.432 | 0.027 | 0.075 | 0.001 |
| 1 | Qwen2.5-0.5B-Instruct | few_shot_sweep | 1.0 | 5.0 | 0.384 | 0.043 | 0.128 | 0.001 |
| 2 | Qwen2.5-0.5B-Instruct | few_shot_sweep | 5.0 | 5.0 | 0.428 | 0.011 | 0.117 | 0.007 |
| 3 | Qwen2.5-0.5B-Instruct | few_shot_sweep | 10.0 | 5.0 | 0.468 | 0.023 | 0.113 | 0.003 |
| 4 | Qwen2.5-0.5B-Instruct | few_shot_sweep | 25.0 | 5.0 | 0.520 | 0.028 | 0.093 | 0.009 |
| 5 | Qwen2.5-0.5B-Instruct | few_shot_sweep | 50.0 | 5.0 | 0.528 | 0.050 | 0.087 | 0.003 |
| 6 | Qwen2.5-0.5B-Instruct | few_shot_sweep | 100.0 | 5.0 | 0.480 | 0.037 | 0.083 | 0.003 |
| 7 | Qwen2.5-0.5B-Instruct | dpo_lora | NaN | 5.0 | 0.624 | 0.017 | 0.093 | 0.001 |
| 8 | Qwen2.5-1.5B-Instruct | baseline | NaN | 5.0 | 0.760 | 0.037 | 0.015 | 0.004 |
| 9 | Qwen2.5-1.5B-Instruct | few_shot_sweep | 1.0 | 5.0 | 0.736 | 0.026 | 0.025 | 0.007 |
| 10 | Qwen2.5-1.5B-Instruct | few_shot_sweep | 5.0 | 5.0 | 0.764 | 0.026 | 0.023 | 0.006 |
| 11 | Qwen2.5-1.5B-Instruct | few_shot_sweep | 10.0 | 5.0 | 0.768 | 0.023 | 0.023 | 0.007 |
| 12 | Qwen2.5-1.5B-Instruct | few_shot_sweep | 25.0 | 5.0 | 0.788 | 0.023 | 0.026 | 0.003 |
| 13 | Qwen2.5-1.5B-Instruct | few_shot_sweep | 50.0 | 5.0 | 0.784 | 0.033 | 0.029 | 0.002 |
| 14 | Qwen2.5-1.5B-Instruct | few_shot_sweep | 100.0 | 5.0 | 0.800 | 0.014 | 0.025 | 0.002 |
| 15 | Qwen2.5-1.5B-Instruct | dpo_lora | NaN | 5.0 | 0.852 | 0.033 | 0.026 | 0.017 |
DPO vs. FewShot¶
We now examine how the DPO-LoRA control performs in comparison to FewShot, particularly as we scale the number of (positive) examples. Both the DPO-LoRA control and baseline (unsteered) pipelines are shown as horizontal reference lines passed in using the compare_to_pipelines argument.
few_shot_df = summary_df[summary_df["pipeline"] == "few_shot_sweep"].copy()
few_shot_df = few_shot_df.sort_values(["model", "k_positive"])
# common axis limits
all_accuracy = runs_df["accuracy"].dropna()
ylim_accuracy = (max(0, all_accuracy.min() - 0.1), min(1, all_accuracy.max() + 0.1))
n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 4))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)
for idx, model_name in enumerate(MODELS):
short_name = model_name.split("/")[-1]
ax = fig.add_subplot(gs[0, idx])
# extract data under each pipeline
model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
model_dpo = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]
# individual trial data (for scatter overlay)
model_trials = runs_df[(runs_df["model"] == short_name) & (runs_df["pipeline"] == "few_shot_sweep")]
plot_sensitivity(
swept=model_swept,
metric="accuracy",
sweep_col="k_positive",
per_trial_data=model_trials,
compare_to_pipelines=[
("baseline", model_baseline),
("DPO-LoRA", model_dpo),
],
ax=ax,
metric_label="accuracy",
sweep_label="k_positive",
title=short_name,
ylim=ylim_accuracy,
)
fig.savefig(FIGURE_DIR / "sensitivity_accuracy.png", bbox_inches="tight", dpi=150)
plt.show()
We can see that fine-tuning (under DPO-LoRA) creates a jump in performance for both models. For the FewShot control, adding a single example actually causes the model to degrade compared to the baseline. Increasing the number of examples generally does improve performance, although accuracy declines after 50 examples for the 0.5B model (and appears to saturate for the 1.5B model).
Accuracy vs positional bias tradeoff¶
We now examine whether there is a tradeoff between accuracy and positional bias across methods. The FewShot configurations are colored by k_positive, with the baseline shown as a black X marker and DPO-LoRA as a red square. The Pareto frontier indicates configurations that are not dominated by any other.
# common axis limits
all_accuracy = runs_df["accuracy"].dropna()
all_bias = runs_df["positional_bias"].dropna()
xlim_tradeoff = (max(0, all_accuracy.min() - 0.05), min(1, all_accuracy.max() + 0.05))
ylim_tradeoff = (max(0, all_bias.min() - 0.02), all_bias.max() + 0.02)
n_models = len(MODELS)
fig = plt.figure(figsize=(5 * n_models, 5))
gs = gridspec.GridSpec(1, n_models, wspace=0.3)
for idx, model_name in enumerate(MODELS):
short_name = model_name.split("/")[-1]
ax = fig.add_subplot(gs[0, idx])
model_swept = few_shot_df[few_shot_df["model"] == short_name].copy()
model_baseline = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "baseline")]
model_dpo = summary_df[(summary_df["model"] == short_name) & (summary_df["pipeline"] == "dpo_lora")]
plot_tradeoff(
swept=model_swept,
x_metric="accuracy",
y_metric="positional_bias",
sweep_col="k_positive",
compare_to_pipelines=[
("baseline", model_baseline),
("DPO-LoRA", model_dpo),
],
ax=ax,
x_label="accuracy",
y_label="positional bias",
sweep_label="k_positive",
title=short_name,
show_pareto=True,
maximize_x=True,
maximize_y=False,
xlim=xlim_tradeoff,
ylim=ylim_tradeoff,
)
fig.savefig(FIGURE_DIR / "tradeoff.png", bbox_inches="tight", dpi=150)
plt.show()
Generally, it appears that few-shot steering under a small-to-moderate number of examples causes the positional bias to jump even if the accuracy improves. Interestingly, in the 0.5B model, as the number of examples increases (to 25-100), the positional bias starts to fall while accuracy continues to improve. The DPO-trained model generally sees the highest accuracy with a slightly higher positional bias than the best few-shot case (50 examples). This observation is similar but less pronounced in the 1.5B model.
Summary table¶
The table below summarizes all configurations ranked by accuracy for all methods/models.
method_order = ["baseline", "FewShot (k=1)", "FewShot (k=5)", "FewShot (k=10)", "FewShot (k=25)", "FewShot (k=50)", "FewShot (k=100)", "DPO-LoRA"]
summary_table = summary_df.copy()
summary_table["method"] = summary_table.apply(
lambda row: "baseline" if row["pipeline"] == "baseline"
else "DPO-LoRA" if row["pipeline"] == "dpo_lora"
else f"FewShot (k={int(row['k_positive'])})",
axis=1
)
model_order = [m.split("/")[-1] for m in MODELS]
summary_table["model_order"] = summary_table["model"].apply(lambda m: model_order.index(m) if m in model_order else len(model_order))
summary_table["method_order"] = summary_table["method"].apply(lambda m: method_order.index(m) if m in method_order else len(method_order))
display_df = summary_table.sort_values(["model_order", "method_order"])[
["model", "method", "n_trials", "accuracy_mean", "accuracy_std", "positional_bias_mean", "positional_bias_std"]
].copy()
display_df.columns = ["model", "method", "trials", "accuracy (mean)", "accuracy (std)", "pos bias (mean)", "pos bias (std)"]
display_df.style.format({
"accuracy (mean)": "{:.1%}",
"accuracy (std)": "{:.1%}",
"pos bias (mean)": "{:.3f}",
"pos bias (std)": "{:.3f}",
}).background_gradient(subset=["accuracy (mean)"], cmap="RdYlGn")
| model | method | trials | accuracy (mean) | accuracy (std) | pos bias (mean) | pos bias (std) | |
|---|---|---|---|---|---|---|---|
| 0 | Qwen2.5-0.5B-Instruct | baseline | 5.000000 | 43.2% | 2.7% | 0.075 | 0.001 |
| 1 | Qwen2.5-0.5B-Instruct | FewShot (k=1) | 5.000000 | 38.4% | 4.3% | 0.128 | 0.001 |
| 2 | Qwen2.5-0.5B-Instruct | FewShot (k=5) | 5.000000 | 42.8% | 1.1% | 0.117 | 0.007 |
| 3 | Qwen2.5-0.5B-Instruct | FewShot (k=10) | 5.000000 | 46.8% | 2.3% | 0.113 | 0.003 |
| 4 | Qwen2.5-0.5B-Instruct | FewShot (k=25) | 5.000000 | 52.0% | 2.8% | 0.093 | 0.009 |
| 5 | Qwen2.5-0.5B-Instruct | FewShot (k=50) | 5.000000 | 52.8% | 5.0% | 0.087 | 0.003 |
| 6 | Qwen2.5-0.5B-Instruct | FewShot (k=100) | 5.000000 | 48.0% | 3.7% | 0.083 | 0.003 |
| 7 | Qwen2.5-0.5B-Instruct | DPO-LoRA | 5.000000 | 62.4% | 1.7% | 0.093 | 0.001 |
| 8 | Qwen2.5-1.5B-Instruct | baseline | 5.000000 | 76.0% | 3.7% | 0.015 | 0.004 |
| 9 | Qwen2.5-1.5B-Instruct | FewShot (k=1) | 5.000000 | 73.6% | 2.6% | 0.025 | 0.007 |
| 10 | Qwen2.5-1.5B-Instruct | FewShot (k=5) | 5.000000 | 76.4% | 2.6% | 0.023 | 0.006 |
| 11 | Qwen2.5-1.5B-Instruct | FewShot (k=10) | 5.000000 | 76.8% | 2.3% | 0.023 | 0.007 |
| 12 | Qwen2.5-1.5B-Instruct | FewShot (k=25) | 5.000000 | 78.8% | 2.3% | 0.026 | 0.003 |
| 13 | Qwen2.5-1.5B-Instruct | FewShot (k=50) | 5.000000 | 78.4% | 3.3% | 0.029 | 0.002 |
| 14 | Qwen2.5-1.5B-Instruct | FewShot (k=100) | 5.000000 | 80.0% | 1.4% | 0.025 | 0.002 |
| 15 | Qwen2.5-1.5B-Instruct | DPO-LoRA | 5.000000 | 85.2% | 3.3% | 0.026 | 0.017 |
Takeaways¶
This notebook compared the effectiveness of LoRA adapters with few-shot learning on a commonsense MCQA task. For the commonsense MCQA task under the models studied (Qwen/Qwen2.5-0.5B-Instruct and Qwen/Qwen2.5-1.5B-Instruct), fine-tuning outperforms FewShot in both models. A single example degrades performance compared to baseline (in both models). Positional bias increases under a small-moderate number of examples but falls as examples increase further (25-100). The accuracy gains under few-shot prompting appear to saturate, or even degrade, as we increase the number of examples and generally seem to achieve half of the gains of the DPO-trained models.