Running TRL methods¶

The toolkit implements some of the TRL methods via a StructuralControl wrapper. This guide shows how to run several methods:

SFT (supervised fine-tuning)
DPO (direct preference optimization)
APO (anchored preference optimization).
SPPO (self-play preference optimization)

Note that while SPPO is not a part of TRL, it follows many of the similar abstractions so we include it as part of our TRL wrapper.

Setup¶

If running this from a Google Colab notebook, please uncomment the following cell to install the toolkit. The following block is not necessary if running this notebook from a virtual environment where the package has already been installed.

In [1]:

Copied!

# !git clone https://github.com/IBM/AISteer360.git
# %cd AISteer360
# !git clone https://github.com/IBM/AISteer360.git
# %cd AISteer360

The following authentication steps may be necessary to access any gated models (after being granted access by Hugging Face). Uncomment the following if you need to log in to the Hugging Face Hub:

In [2]:

Copied!





# !pip install python-dotenv
# !pip install ipywidgets
# from dotenv import load_dotenv
# import os

# load_dotenv()
# token = os.getenv("HUGGINGFACE_TOKEN")
# from huggingface_hub import login
# login(token=token)
# !pip install python-dotenv
# !pip install ipywidgets
# from dotenv import load_dotenv
# import os

# load_dotenv()
# token = os.getenv("HUGGINGFACE_TOKEN")
# from huggingface_hub import login
# login(token=token)

Next, we import the SteeringPipeline class (used throughout) and specify the base model, in this case a small Qwen model.

In [3]:

Copied!





import torch
from datasets import load_dataset
from peft import PeftType
from transformers import AutoTokenizer

from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline


MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct" 

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)
import torch
from datasets import load_dataset
from peft import PeftType
from transformers import AutoTokenizer

from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline


MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct" 

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Using device: cuda

Data Preparation¶

The controls throughout this notebook are trained using a common dataset, ultrafeedback_binarized, since it contains preference data for each prompt (which is necessary for DPO-based controls). We load each of the splits below.

In [4]:

Copied!

raw_train = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
raw_test  = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="test_prefs")
len(raw_train), raw_train[0].keys()
raw_train = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
raw_test  = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="test_prefs")
len(raw_train), raw_train[0].keys()

Out[4]:

(61135,
 dict_keys(['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected']))

Different trainers expect different data formats (i.e., tensor layouts) and thus we define two helper functions, one for SFT and one for DPO, to process the data in a way that is amenable to each.

In [5]:

Copied!





def sft_preprocess(example, tokenizer, max_length=1024):
    text = f"Question: {example['prompt']}\n\nAnswer: {example['chosen']}"
    encoding = tokenizer(text, truncation=True, padding="max_length", max_length=max_length)
    labels = [
        token_id if mask == 1 else -100. # label pads as -100 so they don't contribute to loss
        for token_id, mask in zip(encoding["input_ids"], encoding["attention_mask"])
    ]
    encoding["labels"] = labels
    return encoding

def dpo_filter(example, max_prompt_chars=4000):
    prompt = example["prompt"]
    if len(prompt) > max_prompt_chars:
        prompt = prompt[:max_prompt_chars]
    return {"prompt": prompt, "chosen": example["chosen"], "rejected": example["rejected"]}


subset_size = 500

sft_train = raw_train.select(range(subset_size)).map(
    lambda example: sft_preprocess(example, tokenizer, max_length=1024),
    remove_columns=raw_train.column_names
)

dpo_train = raw_train.select(range(subset_size)).map(dpo_filter, remove_columns=[])
dpo_train[0].keys()
def sft_preprocess(example, tokenizer, max_length=1024):
    text = f"Question: {example['prompt']}\n\nAnswer: {example['chosen']}"
    encoding = tokenizer(text, truncation=True, padding="max_length", max_length=max_length)
    labels = [
        token_id if mask == 1 else -100. # label pads as -100 so they don't contribute to loss
        for token_id, mask in zip(encoding["input_ids"], encoding["attention_mask"])
    ]
    encoding["labels"] = labels
    return encoding

def dpo_filter(example, max_prompt_chars=4000):
    prompt = example["prompt"]
    if len(prompt) > max_prompt_chars:
        prompt = prompt[:max_prompt_chars]
    return {"prompt": prompt, "chosen": example["chosen"], "rejected": example["rejected"]}


subset_size = 500

sft_train = raw_train.select(range(subset_size)).map(
    lambda example: sft_preprocess(example, tokenizer, max_length=1024),
    remove_columns=raw_train.column_names
)

dpo_train = raw_train.select(range(subset_size)).map(dpo_filter, remove_columns=[])
dpo_train[0].keys()

Out[5]:

dict_keys(['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'])

SFT control¶

We now show how to fine-tune with SFT using LoRA. We also merge the trained adapter back into the model (using the argument merge_lora_after_train). Note the argument use_peft=True to indicate that we are not running a full fine-tune (the example near the end of this notebook will illustrate a full fine-tuning run).

In [6]:

Copied!





from aisteer360.algorithms.structural_control.wrappers.trl.sfttrainer.control import SFT


sft = SFT(
    # data
    train_dataset=sft_train,
    eval_dataset=None, 
    # data_collator=None  # optional; if omitted and you provided labels, you're fine

    # TRL / Trainer config (forwarded into SFTConfig)
    output_dir="./tmp/sft_lora",
    max_seq_length=1024,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=2e-5,
    logging_steps=50,
    report_to="none",
    seed=42,

    # PEFT (LoRA)
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    adapter_name="sft",

    # optionally merge LoRA into base weights after training
    merge_lora_after_train=True,
    merged_output_dir="./tmp/sft_lora_merged",
)
from aisteer360.algorithms.structural_control.wrappers.trl.sfttrainer.control import SFT


sft = SFT(
    # data
    train_dataset=sft_train,
    eval_dataset=None, 
    # data_collator=None  # optional; if omitted and you provided labels, you're fine

    # TRL / Trainer config (forwarded into SFTConfig)
    output_dir="./tmp/sft_lora",
    max_seq_length=1024,
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=2e-5,
    logging_steps=50,
    report_to="none",
    seed=42,

    # PEFT (LoRA)
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    adapter_name="sft",

    # optionally merge LoRA into base weights after training
    merge_lora_after_train=True,
    merged_output_dir="./tmp/sft_lora_merged",
)

We create a steering pipeline using the above control, with lazy_init=True since the structural control (sft) returns a model. The pipeline is then steered which invokes the training procedure.

In [7]:

Copied!





sft_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    device_map=None,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[sft],
)

sft_pipeline.steer()
sft_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    device_map=None,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[sft],
)

sft_pipeline.steer()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.

[125/125 01:27, Epoch 1/1]

Step	Training Loss
50	2.533000
100	1.106900

The above SFT-trained pipeline is now ready for inference.

In [8]:

Copied!





prompt_text = "Question: What makes the sky look blue?\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
text = sft_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=64
)
print(text)
prompt_text = "Question: What makes the sky look blue?\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
text = sft_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=64
)
print(text)

[' The sky looks blue because of the scattering of light by tiny dust particles in the atmosphere. These particles are small and light, so they scatter the light that hits them, causing it to bend around them and spread out into a colorless, milky cloud-like appearance known as the "blue" part of the sky.']

DPO control¶

DPO is instantiated in a similar fashion with the primary differences being that the training data is now triples (prompt, chosen, rejected), the trainer must keep a reference policy alongside the trainable policy, and the loss is a pair-wise KL-reg. contrastive objective rather than the token-level cross entropy loss in SFT.

Note: By default, the trainer clones the base weights and freezes them. When LoRA is enabled, the wrapper automatically passes ref_model=None, letting TRL re-create a frozen reference that shares the same LoRA adapters. If you are full fine-tuning you can still supply your own ref_model via pipeline.steer(ref_model=my_frozen_model).

In [9]:

Copied!





from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO


dpo = DPO(
    train_dataset=dpo_train,

    # DPO / TRL config (forwarded into DPOConfig)
    output_dir="./tmp/dpo_lora",
    per_device_train_batch_size=2,  # often smaller than SFT
    num_train_epochs=1,
    learning_rate=1e-6,
    beta=0.1,
    loss_type="sigmoid",  # baseline DPO loss
    max_prompt_length=512,
    max_length=1024,
    precompute_ref_log_probs=True,  # forwarded if supported by your TRL version
    disable_dropout=True,
    logging_steps=50,
    report_to="none",
    seed=123,

    # LoRA
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    adapter_name="dpo",

    merge_lora_after_train=False,
)
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO


dpo = DPO(
    train_dataset=dpo_train,

    # DPO / TRL config (forwarded into DPOConfig)
    output_dir="./tmp/dpo_lora",
    per_device_train_batch_size=2,  # often smaller than SFT
    num_train_epochs=1,
    learning_rate=1e-6,
    beta=0.1,
    loss_type="sigmoid",  # baseline DPO loss
    max_prompt_length=512,
    max_length=1024,
    precompute_ref_log_probs=True,  # forwarded if supported by your TRL version
    disable_dropout=True,
    logging_steps=50,
    report_to="none",
    seed=123,

    # LoRA
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    adapter_name="dpo",

    merge_lora_after_train=False,
)

As before, we create the pipeline using the control, steer the pipeline, and run inference on the steered pipeline.

In [10]:

Copied!





dpo_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[dpo]
)
dpo_pipeline.steer()
dpo_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[dpo]
)
dpo_pipeline.steer()

The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
Train dataset reference log probs: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [01:55<00:00,  2.16it/s]

[250/250 01:54, Epoch 1/1]

Step	Training Loss
50	1.690400
100	1.618300
150	1.799400
200	1.499900
250	1.037700

In [11]:

Copied!





prompt_text = "Question: Is it ever helpful to be blunt with feedback?\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
print(dpo_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=150,
))
prompt_text = "Question: Is it ever helpful to be blunt with feedback?\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
print(dpo_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=150,
))

[' Yes, it is always helpful to be blunt with feedback. Blunt feedback can help you identify areas of improvement and provide a clear path for change. It also helps to build trust between the person being evaluated and the person giving the feedback.\n\nFor example, if someone gives you feedback that says "You need to improve your writing skills," you could respond by saying "I agree, but I think we should focus on improving our research methods instead." This response provides constructive criticism without sounding accusatory or dismissive.\n\nBlunt feedback can also help to motivate people to take action towards their goals. If someone gives you feedback that says "You need to work harder on this project," you could say "Thank you for your input, but I think we can']

APO control¶

APO lives in the same trainer family as DPO and uses the same DPOTrainer class (it is activated by simply choosing a different loss_type). In contrast to DPO that pushes the policy away from the reference (by a relative KL-scaled margin), APO pushes the policy toward a fixed "anchor" score. Generally, APO keeps the policy closer to the reference for the same beta, reducing the risk of over-optimization.

In [12]:

Copied!





from aisteer360.algorithms.structural_control.wrappers.trl.apotrainer.control import APO


apo = APO(
    # data
    train_dataset=dpo_train,

    # APO / TRL config 
    output_dir="./tmp/apo_lora",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=1e-6,
    beta=0.1,
    loss_type="apo_zero",     # APO-specific loss
    max_prompt_length=512,
    max_length=1024,
    logging_steps=50,
    report_to="none",
    seed=99,

    # LoRA
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    adapter_name="apo",
    
    merge_lora_after_train=False,
)
from aisteer360.algorithms.structural_control.wrappers.trl.apotrainer.control import APO


apo = APO(
    # data
    train_dataset=dpo_train,

    # APO / TRL config 
    output_dir="./tmp/apo_lora",
    per_device_train_batch_size=2,
    num_train_epochs=1,
    learning_rate=1e-6,
    beta=0.1,
    loss_type="apo_zero",     # APO-specific loss
    max_prompt_length=512,
    max_length=1024,
    logging_steps=50,
    report_to="none",
    seed=99,

    # LoRA
    use_peft=True,
    peft_type=PeftType.LORA,
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    adapter_name="apo",
    
    merge_lora_after_train=False,
)

Steering and inference proceeds as before.

In [13]:

Copied!





apo_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[apo]
)
apo_pipeline.steer()
apo_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[apo]
)
apo_pipeline.steer()

The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
Train dataset reference log probs: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [01:54<00:00,  2.19it/s]

[250/250 01:55, Epoch 1/1]

Step	Training Loss
50	1.002600
100	0.997600
150	0.997200
200	0.986300
250	1.004500

In [14]:

Copied!





prompt_text = "Question: Explain why kindness can be strategic.\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
print(apo_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=64,
))
prompt_text = "Question: Explain why kindness can be strategic.\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
print(apo_pipeline.generate_text(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    max_new_tokens=64,
))

[' Kindness is a powerful tool that can be used strategically in various situations. It allows us to connect with others, build trust and relationships, and promote positive change. By being kind, we can create a positive impact on the world and help others in need. Additionally, kindness can be used as a way to set an']

SPPO control¶

SPPO, or self-play preference optimization, can be thought of as extending the offline DPO setting into an on-policy, self-improving loop. The data starts with only a prompt corpus (no human-written answers required). During trainin the policy generates two candidate answers itself. Next, a preference model (or a heuristic judge) ranks the two self-generated candidates. The chosen-rejected is then fed through the DPO-style loss.

Because the answers were sampled from the current policy, the optimization is on-policy with the model producing new pairs every few steps so it continuously trains on its own mistakes. A reference model is still necessary to stabilize learning.

SPPO is implemented via SPPOTrainer and uses the same DPOTrainerMixin.

In [15]:

Copied!





import sys
!{sys.executable} -m ensurepip --upgrade
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install llm-blender
import sys
!{sys.executable} -m ensurepip --upgrade
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install llm-blender

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Looking in links: /tmp/tmpkcvre41j
Requirement already satisfied: setuptools in ./.venv/lib/python3.11/site-packages (80.9.0)
Requirement already satisfied: pip in ./.venv/lib/python3.11/site-packages (25.2)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Requirement already satisfied: pip in ./.venv/lib/python3.11/site-packages (25.2)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Requirement already satisfied: llm-blender in ./.venv/lib/python3.11/site-packages (0.0.2)
Requirement already satisfied: transformers in ./.venv/lib/python3.11/site-packages (from llm-blender) (4.57.1)
Requirement already satisfied: torch in ./.venv/lib/python3.11/site-packages (from llm-blender) (2.9.0)
Requirement already satisfied: numpy in ./.venv/lib/python3.11/site-packages (from llm-blender) (2.3.4)
Requirement already satisfied: accelerate in ./.venv/lib/python3.11/site-packages (from llm-blender) (1.3.0)
Requirement already satisfied: safetensors in ./.venv/lib/python3.11/site-packages (from llm-blender) (0.4.5)
Requirement already satisfied: dataclasses-json in ./.venv/lib/python3.11/site-packages (from llm-blender) (0.6.7)
Requirement already satisfied: sentencepiece in ./.venv/lib/python3.11/site-packages (from llm-blender) (0.2.1)
Requirement already satisfied: protobuf in ./.venv/lib/python3.11/site-packages (from llm-blender) (6.33.0)
Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.11/site-packages (from accelerate->llm-blender) (25.0)
Requirement already satisfied: psutil in ./.venv/lib/python3.11/site-packages (from accelerate->llm-blender) (7.1.0)
Requirement already satisfied: pyyaml in ./.venv/lib/python3.11/site-packages (from accelerate->llm-blender) (6.0.3)
Requirement already satisfied: huggingface-hub>=0.21.0 in ./.venv/lib/python3.11/site-packages (from accelerate->llm-blender) (0.35.3)
Requirement already satisfied: filelock in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (3.20.0)
Requirement already satisfied: fsspec>=2023.5.0 in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (2025.3.0)
Requirement already satisfied: requests in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (2.32.5)
Requirement already satisfied: tqdm>=4.42.1 in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (4.66.5)
Requirement already satisfied: typing-extensions>=3.7.4.3 in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (4.15.0)
Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (1.1.10)
Requirement already satisfied: sympy>=1.13.3 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (1.14.0)
Requirement already satisfied: networkx>=2.5.1 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (3.5)
Requirement already satisfied: jinja2 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (3.1.6)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.93)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.90)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.90)
Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (9.10.2.21)
Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.4.1)
Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (11.3.3.83)
Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (10.3.9.90)
Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (11.7.3.90)
Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.5.8.93)
Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (0.7.1)
Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (2.27.5)
Requirement already satisfied: nvidia-nvshmem-cu12==3.3.20 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (3.3.20)
Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.90)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.93)
Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (1.13.1.3)
Requirement already satisfied: triton==3.5.0 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (3.5.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in ./.venv/lib/python3.11/site-packages (from sympy>=1.13.3->torch->llm-blender) (1.3.0)
Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in ./.venv/lib/python3.11/site-packages (from dataclasses-json->llm-blender) (3.26.1)
Requirement already satisfied: typing-inspect<1,>=0.4.0 in ./.venv/lib/python3.11/site-packages (from dataclasses-json->llm-blender) (0.9.0)
Requirement already satisfied: mypy-extensions>=0.3.0 in ./.venv/lib/python3.11/site-packages (from typing-inspect<1,>=0.4.0->dataclasses-json->llm-blender) (1.1.0)
Requirement already satisfied: MarkupSafe>=2.0 in ./.venv/lib/python3.11/site-packages (from jinja2->torch->llm-blender) (3.0.3)
Requirement already satisfied: charset_normalizer<4,>=2 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub>=0.21.0->accelerate->llm-blender) (3.4.4)
Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub>=0.21.0->accelerate->llm-blender) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub>=0.21.0->accelerate->llm-blender) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub>=0.21.0->accelerate->llm-blender) (2025.10.5)
Requirement already satisfied: regex!=2019.12.17 in ./.venv/lib/python3.11/site-packages (from transformers->llm-blender) (2025.9.18)
Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in ./.venv/lib/python3.11/site-packages (from transformers->llm-blender) (0.22.1)

In [ ]:

Copied!





from aisteer360.algorithms.structural_control.wrappers.trl.sppotrainer.control import SPPO


subset = raw_train.select(range(200)).map(lambda ex: {"prompt": ex["prompt"]}, remove_columns=raw_train.column_names)

sppo = SPPO(
    # data
    train_dataset=subset,

    # SPPO params
    start_iteration=1,
    end_iteration=5,
    max_input_length=1024,
    num_prompts=5,
    temp_dir="./tmp/sppo_temp",
    gen_max_new_tokens=32,  #100,
    ranking_batch_size=8,
    limit_num_examples=20,  #50,

    # TRL/DPO-compatible params
    output_dir="./tmp/sppo_final",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-6,
    beta=0.001,
    loss_type="sppo",
    max_prompt_length=512,
    max_length=1024,
    logging_steps=50,
    report_to="none",
    seed=123,
)
from aisteer360.algorithms.structural_control.wrappers.trl.sppotrainer.control import SPPO


subset = raw_train.select(range(200)).map(lambda ex: {"prompt": ex["prompt"]}, remove_columns=raw_train.column_names)

sppo = SPPO(
    # data
    train_dataset=subset,

    # SPPO params
    start_iteration=1,
    end_iteration=5,
    max_input_length=1024,
    num_prompts=5,
    temp_dir="./tmp/sppo_temp",
    gen_max_new_tokens=32,  #100,
    ranking_batch_size=8,
    limit_num_examples=20,  #50,

    # TRL/DPO-compatible params
    output_dir="./tmp/sppo_final",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-6,
    beta=0.001,
    loss_type="sppo",
    max_prompt_length=512,
    max_length=1024,
    logging_steps=50,
    report_to="none",
    seed=123,
)

We can now construct a steering pipeline, steer it (runs one SPPO iteration, saves checkpoint and final model), and run inference on the steered pipeline.

In [18]:

Copied!





pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[sppo]
)
pipeline.steer()
pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[sppo]
)
pipeline.steer()

WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker()
WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser()
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(

Successfully loaded ranker from  /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM

Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.60s/it]
Generating train split: 20 examples [00:00, 2334.45 examples/s]
Generating train split: 20 examples [00:00, 7366.18 examples/s]
Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1979.75 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 399.91 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
Could not estimate the number of tokens of the input, floating-point operations will not be computed

[30/30 00:08, Epoch 3/3]

Step	Training Loss
1	132836.983100

WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker()
WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser()
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(

Successfully loaded ranker from  /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM

Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.14s/it]
Generating train split: 20 examples [00:00, 4475.12 examples/s]
Generating train split: 20 examples [00:00, 7029.76 examples/s]
Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1232.73 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 475.02 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.

[30/30 00:08, Epoch 3/3]

Step	Training Loss
1	32124.715300

WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker()
WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser()
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(

Successfully loaded ranker from  /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM

Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.13s/it]
Generating train split: 20 examples [00:00, 5246.49 examples/s]
Generating train split: 20 examples [00:00, 6995.17 examples/s]
Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1120.65 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 445.28 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.

[30/30 00:08, Epoch 3/3]

Step	Training Loss
1	117638.039400

WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker()
WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser()
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(

Successfully loaded ranker from  /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM

Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.16s/it]
Generating train split: 20 examples [00:00, 3540.39 examples/s]
Generating train split: 20 examples [00:00, 6495.24 examples/s]
Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 2013.83 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 356.93 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.

[30/30 00:08, Epoch 3/3]

Step	Training Loss
1	127814.499100

WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker()
WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser()
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig.
  warnings.warn(
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
  warnings.warn(

Successfully loaded ranker from  /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM

Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.18s/it]
Generating train split: 20 examples [00:00, 4302.95 examples/s]
Generating train split: 20 examples [00:00, 8152.99 examples/s]
Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 2189.67 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 526.71 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.

[30/30 00:08, Epoch 3/3]

Step	Training Loss
1	29349.257000

In [19]:

Copied!





prompt = "Write a short, constructive response to: 'My neighbor is noisy.'"
enc = tokenizer(prompt, return_tensors="pt")
print(pipeline.generate_text(
    input_ids=enc["input_ids"],
    attention_mask=enc["attention_mask"],
    max_new_tokens=64,
))
prompt = "Write a short, constructive response to: 'My neighbor is noisy.'"
enc = tokenizer(prompt, return_tensors="pt")
print(pipeline.generate_text(
    input_ids=enc["input_ids"],
    attention_mask=enc["attention_mask"],
    max_new_tokens=64,
))

[" What should I do? Responding to someone else's noise, especially in a quiet corner, can be very frustrating and draining. It's important to communicate your concerns to the appropriate authority, whether that be a doctor, lawyer, or counselor. Additionally, try to find a solution that works for you, even if it"]

Full-parameter SFT¶

Lastly, to run a full-weight fine-tune set use_peft=False, drop the LoRA arguments, and usually shrink the batch size (because every parameter now receives gradients).

Note: Full fine-tuning can be 10-20 times more memory-intensive than LoRA.

In [20]:

Copied!





full_sft = SFT(
    train_dataset=sft_train,
    use_peft=False,  # full FT
    output_dir="./tmp/sft_full",
    per_device_train_batch_size=1,
    num_train_epochs=1,
    learning_rate=5e-6,
    report_to="none",
    seed=7,
)
full_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[full_sft]
)
full_pipeline.steer()
full_sft = SFT(
    train_dataset=sft_train,
    use_peft=False,  # full FT
    output_dir="./tmp/sft_full",
    per_device_train_batch_size=1,
    num_train_epochs=1,
    learning_rate=5e-6,
    report_to="none",
    seed=7,
)
full_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[full_sft]
)
full_pipeline.steer()

The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.

[500/500 02:18, Epoch 1/1]

Step	Training Loss
10	1.668000
20	0.812600
30	0.339700
40	0.810500
50	0.347900
60	0.648100
70	0.666800
80	0.634900
90	0.636100
100	0.824400
110	0.595500
120	0.429500
130	0.518600
140	0.790800
150	0.726600
160	0.469200
170	0.773000
180	0.777700
190	1.006300
200	0.559700
210	0.429800
220	0.703600
230	0.680900
240	0.583600
250	0.418600
260	0.673700
270	0.603700
280	0.787700
290	0.713500
300	0.456800
310	0.805500
320	0.528500
330	0.515000
340	0.745700
350	0.649700
360	0.711400
370	0.620400
380	0.855500
390	0.549000
400	0.692200
410	0.395900
420	0.690000
430	0.464800
440	0.458200
450	0.925600
460	0.555400
470	0.673500
480	0.688000
490	0.711300
500	0.705200

The wrapper also provides functionality for resuming training if interrupted (via TRL's resume_from_checkpoint) by providing either the directory path of the checkpoint name in output_dir.

In [21]:

Copied!





resume_sft = SFT(
    train_dataset=sft_train,
    output_dir="./tmp/sft_lora",
    resume_from_checkpoint="./tmp/sft_lora/checkpoint-1000",
    use_peft=True,
    adapter_name="sft",
    report_to="none",
)
resume_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[resume_sft]
)
resume_pipeline.steer()
resume_sft = SFT(
    train_dataset=sft_train,
    output_dir="./tmp/sft_lora",
    resume_from_checkpoint="./tmp/sft_lora/checkpoint-1000",
    use_peft=True,
    adapter_name="sft",
    report_to="none",
)
resume_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    hf_model_kwargs={"trust_remote_code": True},
    controls=[resume_sft]
)
resume_pipeline.steer()

The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.

[189/189 04:11, Epoch 3/3]

Step	Training Loss
10	3.607000
20	1.869800
30	1.269600
40	1.049900
50	0.993700
60	0.854200
70	0.904400
80	0.804900
90	0.837500
100	0.963000
110	0.653100
120	0.640800
130	0.707400
140	0.730500
150	0.763000
160	0.685400
170	0.814300
180	0.710300

In [ ]: