Running TRL methods¶
The toolkit implements some of the TRL methods via a StructuralControl wrapper. This guide shows how to run several methods:
- SFT (supervised fine-tuning)
- DPO (direct preference optimization)
- APO (anchored preference optimization).
- SPPO (self-play preference optimization)
Note that while SPPO is not a part of TRL, it follows many of the similar abstractions so we include it as part of our TRL wrapper.
Setup¶
If running this from a Google Colab notebook, please uncomment the following cell to install the toolkit. The following block is not necessary if running this notebook from a virtual environment where the package has already been installed.
# !git clone https://github.com/IBM/AISteer360.git
# %cd AISteer360
The following authentication steps may be necessary to access any gated models (after being granted access by Hugging Face). Uncomment the following if you need to log in to the Hugging Face Hub:
# !pip install python-dotenv
# !pip install ipywidgets
# from dotenv import load_dotenv
# import os
# load_dotenv()
# token = os.getenv("HUGGINGFACE_TOKEN")
# from huggingface_hub import login
# login(token=token)
Next, we import the SteeringPipeline class (used throughout) and specify the base model, in this case a small Qwen model.
import torch
from datasets import load_dataset
from peft import PeftType
from transformers import AutoTokenizer
from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Using device: cuda
Data Preparation¶
The controls throughout this notebook are trained using a common dataset, ultrafeedback_binarized, since it contains preference data for each prompt (which is necessary for DPO-based controls). We load each of the splits below.
raw_train = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
raw_test = load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="test_prefs")
len(raw_train), raw_train[0].keys()
(61135, dict_keys(['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected']))
Different trainers expect different data formats (i.e., tensor layouts) and thus we define two helper functions, one for SFT and one for DPO, to process the data in a way that is amenable to each.
def sft_preprocess(example, tokenizer, max_length=1024):
text = f"Question: {example['prompt']}\n\nAnswer: {example['chosen']}"
encoding = tokenizer(text, truncation=True, padding="max_length", max_length=max_length)
labels = [
token_id if mask == 1 else -100. # label pads as -100 so they don't contribute to loss
for token_id, mask in zip(encoding["input_ids"], encoding["attention_mask"])
]
encoding["labels"] = labels
return encoding
def dpo_filter(example, max_prompt_chars=4000):
prompt = example["prompt"]
if len(prompt) > max_prompt_chars:
prompt = prompt[:max_prompt_chars]
return {"prompt": prompt, "chosen": example["chosen"], "rejected": example["rejected"]}
subset_size = 500
sft_train = raw_train.select(range(subset_size)).map(
lambda example: sft_preprocess(example, tokenizer, max_length=1024),
remove_columns=raw_train.column_names
)
dpo_train = raw_train.select(range(subset_size)).map(dpo_filter, remove_columns=[])
dpo_train[0].keys()
dict_keys(['prompt', 'prompt_id', 'chosen', 'rejected', 'messages', 'score_chosen', 'score_rejected'])
SFT control¶
We now show how to fine-tune with SFT using LoRA. We also merge the trained adapter back into the model (using the argument merge_lora_after_train). Note the argument use_peft=True to indicate that we are not running a full fine-tune (the example near the end of this notebook will illustrate a full fine-tuning run).
from aisteer360.algorithms.structural_control.wrappers.trl.sfttrainer.control import SFT
sft = SFT(
# data
train_dataset=sft_train,
eval_dataset=None,
# data_collator=None # optional; if omitted and you provided labels, you're fine
# TRL / Trainer config (forwarded into SFTConfig)
output_dir="./tmp/sft_lora",
max_seq_length=1024,
per_device_train_batch_size=4,
num_train_epochs=1,
learning_rate=2e-5,
logging_steps=50,
report_to="none",
seed=42,
# PEFT (LoRA)
use_peft=True,
peft_type=PeftType.LORA,
r=16,
lora_alpha=16,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
adapter_name="sft",
# optionally merge LoRA into base weights after training
merge_lora_after_train=True,
merged_output_dir="./tmp/sft_lora_merged",
)
We create a steering pipeline using the above control, with lazy_init=True since the structural control (sft) returns a model. The pipeline is then steered which invokes the training procedure.
sft_pipeline = SteeringPipeline(
model_name_or_path=MODEL_NAME,
device_map=None,
hf_model_kwargs={"trust_remote_code": True},
controls=[sft],
)
sft_pipeline.steer()
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
| Step | Training Loss |
|---|---|
| 50 | 2.533000 |
| 100 | 1.106900 |
The above SFT-trained pipeline is now ready for inference.
prompt_text = "Question: What makes the sky look blue?\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
text = sft_pipeline.generate_text(
input_ids=encoded["input_ids"],
attention_mask=encoded["attention_mask"],
max_new_tokens=64
)
print(text)
[' The sky looks blue because of the scattering of light by tiny dust particles in the atmosphere. These particles are small and light, so they scatter the light that hits them, causing it to bend around them and spread out into a colorless, milky cloud-like appearance known as the "blue" part of the sky.']
DPO control¶
DPO is instantiated in a similar fashion with the primary differences being that the training data is now triples (prompt, chosen, rejected), the trainer must keep a reference policy alongside the trainable policy, and the loss is a pair-wise KL-reg. contrastive objective rather than the token-level cross entropy loss in SFT.
Note: By default, the trainer clones the base weights and freezes them. When LoRA is enabled, the wrapper automatically passes ref_model=None, letting TRL re-create a frozen reference that shares the same LoRA adapters. If you are full fine-tuning you can still supply your own ref_model via pipeline.steer(ref_model=my_frozen_model).
from aisteer360.algorithms.structural_control.wrappers.trl.dpotrainer.control import DPO
dpo = DPO(
train_dataset=dpo_train,
# DPO / TRL config (forwarded into DPOConfig)
output_dir="./tmp/dpo_lora",
per_device_train_batch_size=2, # often smaller than SFT
num_train_epochs=1,
learning_rate=1e-6,
beta=0.1,
loss_type="sigmoid", # baseline DPO loss
max_prompt_length=512,
max_length=1024,
precompute_ref_log_probs=True, # forwarded if supported by your TRL version
disable_dropout=True,
logging_steps=50,
report_to="none",
seed=123,
# LoRA
use_peft=True,
peft_type=PeftType.LORA,
r=16,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
adapter_name="dpo",
merge_lora_after_train=False,
)
As before, we create the pipeline using the control, steer the pipeline, and run inference on the steered pipeline.
dpo_pipeline = SteeringPipeline(
model_name_or_path=MODEL_NAME,
hf_model_kwargs={"trust_remote_code": True},
controls=[dpo]
)
dpo_pipeline.steer()
The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
Train dataset reference log probs: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [01:55<00:00, 2.16it/s]
| Step | Training Loss |
|---|---|
| 50 | 1.690400 |
| 100 | 1.618300 |
| 150 | 1.799400 |
| 200 | 1.499900 |
| 250 | 1.037700 |
prompt_text = "Question: Is it ever helpful to be blunt with feedback?\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
print(dpo_pipeline.generate_text(
input_ids=encoded["input_ids"],
attention_mask=encoded["attention_mask"],
max_new_tokens=150,
))
[' Yes, it is always helpful to be blunt with feedback. Blunt feedback can help you identify areas of improvement and provide a clear path for change. It also helps to build trust between the person being evaluated and the person giving the feedback.\n\nFor example, if someone gives you feedback that says "You need to improve your writing skills," you could respond by saying "I agree, but I think we should focus on improving our research methods instead." This response provides constructive criticism without sounding accusatory or dismissive.\n\nBlunt feedback can also help to motivate people to take action towards their goals. If someone gives you feedback that says "You need to work harder on this project," you could say "Thank you for your input, but I think we can']
APO control¶
APO lives in the same trainer family as DPO and uses the same DPOTrainer class (it is activated by simply choosing a different loss_type). In contrast to DPO that pushes the policy away from the reference (by a relative KL-scaled margin), APO pushes the policy toward a fixed "anchor" score. Generally, APO keeps the policy closer to the reference for the same beta, reducing the risk of over-optimization.
from aisteer360.algorithms.structural_control.wrappers.trl.apotrainer.control import APO
apo = APO(
# data
train_dataset=dpo_train,
# APO / TRL config
output_dir="./tmp/apo_lora",
per_device_train_batch_size=2,
num_train_epochs=1,
learning_rate=1e-6,
beta=0.1,
loss_type="apo_zero", # APO-specific loss
max_prompt_length=512,
max_length=1024,
logging_steps=50,
report_to="none",
seed=99,
# LoRA
use_peft=True,
peft_type=PeftType.LORA,
r=16,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
adapter_name="apo",
merge_lora_after_train=False,
)
Steering and inference proceeds as before.
apo_pipeline = SteeringPipeline(
model_name_or_path=MODEL_NAME,
hf_model_kwargs={"trust_remote_code": True},
controls=[apo]
)
apo_pipeline.steer()
The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
Train dataset reference log probs: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [01:54<00:00, 2.19it/s]
| Step | Training Loss |
|---|---|
| 50 | 1.002600 |
| 100 | 0.997600 |
| 150 | 0.997200 |
| 200 | 0.986300 |
| 250 | 1.004500 |
prompt_text = "Question: Explain why kindness can be strategic.\n\nAnswer:"
encoded = tokenizer(prompt_text, return_tensors="pt")
print(apo_pipeline.generate_text(
input_ids=encoded["input_ids"],
attention_mask=encoded["attention_mask"],
max_new_tokens=64,
))
[' Kindness is a powerful tool that can be used strategically in various situations. It allows us to connect with others, build trust and relationships, and promote positive change. By being kind, we can create a positive impact on the world and help others in need. Additionally, kindness can be used as a way to set an']
SPPO control¶
SPPO, or self-play preference optimization, can be thought of as extending the offline DPO setting into an on-policy, self-improving loop. The data starts with only a prompt corpus (no human-written answers required). During trainin the policy generates two candidate answers itself. Next, a preference model (or a heuristic judge) ranks the two self-generated candidates. The chosen-rejected is then fed through the DPO-style loss.
Because the answers were sampled from the current policy, the optimization is on-policy with the model producing new pairs every few steps so it continuously trains on its own mistakes. A reference model is still necessary to stabilize learning.
SPPO is implemented via SPPOTrainer and uses the same DPOTrainerMixin.
import sys
!{sys.executable} -m ensurepip --upgrade
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install llm-blender
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Looking in links: /tmp/tmpkcvre41j Requirement already satisfied: setuptools in ./.venv/lib/python3.11/site-packages (80.9.0) Requirement already satisfied: pip in ./.venv/lib/python3.11/site-packages (25.2)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Requirement already satisfied: pip in ./.venv/lib/python3.11/site-packages (25.2)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Requirement already satisfied: llm-blender in ./.venv/lib/python3.11/site-packages (0.0.2) Requirement already satisfied: transformers in ./.venv/lib/python3.11/site-packages (from llm-blender) (4.57.1) Requirement already satisfied: torch in ./.venv/lib/python3.11/site-packages (from llm-blender) (2.9.0) Requirement already satisfied: numpy in ./.venv/lib/python3.11/site-packages (from llm-blender) (2.3.4) Requirement already satisfied: accelerate in ./.venv/lib/python3.11/site-packages (from llm-blender) (1.3.0) Requirement already satisfied: safetensors in ./.venv/lib/python3.11/site-packages (from llm-blender) (0.4.5) Requirement already satisfied: dataclasses-json in ./.venv/lib/python3.11/site-packages (from llm-blender) (0.6.7) Requirement already satisfied: sentencepiece in ./.venv/lib/python3.11/site-packages (from llm-blender) (0.2.1) Requirement already satisfied: protobuf in ./.venv/lib/python3.11/site-packages (from llm-blender) (6.33.0) Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.11/site-packages (from accelerate->llm-blender) (25.0) Requirement already satisfied: psutil in ./.venv/lib/python3.11/site-packages (from accelerate->llm-blender) (7.1.0) Requirement already satisfied: pyyaml in ./.venv/lib/python3.11/site-packages (from accelerate->llm-blender) (6.0.3) Requirement already satisfied: huggingface-hub>=0.21.0 in ./.venv/lib/python3.11/site-packages (from accelerate->llm-blender) (0.35.3) Requirement already satisfied: filelock in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (3.20.0) Requirement already satisfied: fsspec>=2023.5.0 in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (2025.3.0) Requirement already satisfied: requests in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (2.32.5) Requirement already satisfied: tqdm>=4.42.1 in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (4.66.5) Requirement already satisfied: typing-extensions>=3.7.4.3 in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (4.15.0) Requirement already satisfied: hf-xet<2.0.0,>=1.1.3 in ./.venv/lib/python3.11/site-packages (from huggingface-hub>=0.21.0->accelerate->llm-blender) (1.1.10) Requirement already satisfied: sympy>=1.13.3 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (1.14.0) Requirement already satisfied: networkx>=2.5.1 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (3.5) Requirement already satisfied: jinja2 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (3.1.6) Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.8.93 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.93) Requirement already satisfied: nvidia-cuda-runtime-cu12==12.8.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.90) Requirement already satisfied: nvidia-cuda-cupti-cu12==12.8.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.90) Requirement already satisfied: nvidia-cudnn-cu12==9.10.2.21 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (9.10.2.21) Requirement already satisfied: nvidia-cublas-cu12==12.8.4.1 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.4.1) Requirement already satisfied: nvidia-cufft-cu12==11.3.3.83 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (11.3.3.83) Requirement already satisfied: nvidia-curand-cu12==10.3.9.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (10.3.9.90) Requirement already satisfied: nvidia-cusolver-cu12==11.7.3.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (11.7.3.90) Requirement already satisfied: nvidia-cusparse-cu12==12.5.8.93 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.5.8.93) Requirement already satisfied: nvidia-cusparselt-cu12==0.7.1 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (0.7.1) Requirement already satisfied: nvidia-nccl-cu12==2.27.5 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (2.27.5) Requirement already satisfied: nvidia-nvshmem-cu12==3.3.20 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (3.3.20) Requirement already satisfied: nvidia-nvtx-cu12==12.8.90 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.90) Requirement already satisfied: nvidia-nvjitlink-cu12==12.8.93 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (12.8.93) Requirement already satisfied: nvidia-cufile-cu12==1.13.1.3 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (1.13.1.3) Requirement already satisfied: triton==3.5.0 in ./.venv/lib/python3.11/site-packages (from torch->llm-blender) (3.5.0) Requirement already satisfied: mpmath<1.4,>=1.1.0 in ./.venv/lib/python3.11/site-packages (from sympy>=1.13.3->torch->llm-blender) (1.3.0) Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in ./.venv/lib/python3.11/site-packages (from dataclasses-json->llm-blender) (3.26.1) Requirement already satisfied: typing-inspect<1,>=0.4.0 in ./.venv/lib/python3.11/site-packages (from dataclasses-json->llm-blender) (0.9.0) Requirement already satisfied: mypy-extensions>=0.3.0 in ./.venv/lib/python3.11/site-packages (from typing-inspect<1,>=0.4.0->dataclasses-json->llm-blender) (1.1.0) Requirement already satisfied: MarkupSafe>=2.0 in ./.venv/lib/python3.11/site-packages (from jinja2->torch->llm-blender) (3.0.3) Requirement already satisfied: charset_normalizer<4,>=2 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub>=0.21.0->accelerate->llm-blender) (3.4.4) Requirement already satisfied: idna<4,>=2.5 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub>=0.21.0->accelerate->llm-blender) (3.11) Requirement already satisfied: urllib3<3,>=1.21.1 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub>=0.21.0->accelerate->llm-blender) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in ./.venv/lib/python3.11/site-packages (from requests->huggingface-hub>=0.21.0->accelerate->llm-blender) (2025.10.5) Requirement already satisfied: regex!=2019.12.17 in ./.venv/lib/python3.11/site-packages (from transformers->llm-blender) (2025.9.18) Requirement already satisfied: tokenizers<=0.23.0,>=0.22.0 in ./.venv/lib/python3.11/site-packages (from transformers->llm-blender) (0.22.1)
from aisteer360.algorithms.structural_control.wrappers.trl.sppotrainer.control import SPPO
subset = raw_train.select(range(200)).map(lambda ex: {"prompt": ex["prompt"]}, remove_columns=raw_train.column_names)
sppo = SPPO(
# data
train_dataset=subset,
# SPPO params
start_iteration=1,
end_iteration=5,
max_input_length=1024,
num_prompts=5,
temp_dir="./tmp/sppo_temp",
gen_max_new_tokens=32, #100,
ranking_batch_size=8,
limit_num_examples=20, #50,
# TRL/DPO-compatible params
output_dir="./tmp/sppo_final",
per_device_train_batch_size=2,
num_train_epochs=3,
learning_rate=5e-6,
beta=0.001,
loss_type="sppo",
max_prompt_length=512,
max_length=1024,
logging_steps=50,
report_to="none",
seed=123,
)
We can now construct a steering pipeline, steer it (runs one SPPO iteration, saves checkpoint and final model), and run inference on the steered pipeline.
pipeline = SteeringPipeline(
model_name_or_path=MODEL_NAME,
hf_model_kwargs={"trust_remote_code": True},
controls=[sppo]
)
pipeline.steer()
WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker() WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser() /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn(
Successfully loaded ranker from /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM
Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00, 1.60s/it]
Generating train split: 20 examples [00:00, 2334.45 examples/s]
Generating train split: 20 examples [00:00, 7366.18 examples/s]
Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1979.75 examples/s]
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 399.91 examples/s]
The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
Could not estimate the number of tokens of the input, floating-point operations will not be computed
| Step | Training Loss |
|---|---|
| 1 | 132836.983100 |
WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker() WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser() /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn(
Successfully loaded ranker from /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM
Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.14s/it] Generating train split: 20 examples [00:00, 4475.12 examples/s] Generating train split: 20 examples [00:00, 7029.76 examples/s] Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1232.73 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 475.02 examples/s] The model is already on multiple devices. Skipping the move to device specified in `args`.
| Step | Training Loss |
|---|---|
| 1 | 32124.715300 |
WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker() WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser() /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn(
Successfully loaded ranker from /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM
Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.13s/it] Generating train split: 20 examples [00:00, 5246.49 examples/s] Generating train split: 20 examples [00:00, 6995.17 examples/s] Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1120.65 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 445.28 examples/s] The model is already on multiple devices. Skipping the move to device specified in `args`.
| Step | Training Loss |
|---|---|
| 1 | 117638.039400 |
WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker() WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser() /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn(
Successfully loaded ranker from /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM
Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.16s/it] Generating train split: 20 examples [00:00, 3540.39 examples/s] Generating train split: 20 examples [00:00, 6495.24 examples/s] Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 2013.83 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 356.93 examples/s] The model is already on multiple devices. Skipping the move to device specified in `args`.
| Step | Training Loss |
|---|---|
| 1 | 127814.499100 |
WARNING:root:No ranker config provided, no ranker loaded, please load ranker first through load_ranker() WARNING:root:No fuser config provided, no fuser loaded, please load fuser first through load_fuser() /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type load_checkpoint detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/dataclasses_json/core.py:201: RuntimeWarning: 'NoneType' object value of non-optional type device detected when decoding RankerConfig. warnings.warn( /dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/transformers/convert_slow_tokenizer.py:564: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text. warnings.warn(
Successfully loaded ranker from /dccstor/principled_ai/users/erikmiehling/huggingface_cache/hub/llm-blender/PairRM
Ranking candidates: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.18s/it] Generating train split: 20 examples [00:00, 4302.95 examples/s] Generating train split: 20 examples [00:00, 8152.99 examples/s] Formatting comparisons with prompt template: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 2189.67 examples/s] Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 526.71 examples/s] The model is already on multiple devices. Skipping the move to device specified in `args`.
| Step | Training Loss |
|---|---|
| 1 | 29349.257000 |
prompt = "Write a short, constructive response to: 'My neighbor is noisy.'"
enc = tokenizer(prompt, return_tensors="pt")
print(pipeline.generate_text(
input_ids=enc["input_ids"],
attention_mask=enc["attention_mask"],
max_new_tokens=64,
))
[" What should I do? Responding to someone else's noise, especially in a quiet corner, can be very frustrating and draining. It's important to communicate your concerns to the appropriate authority, whether that be a doctor, lawyer, or counselor. Additionally, try to find a solution that works for you, even if it"]
Full-parameter SFT¶
Lastly, to run a full-weight fine-tune set use_peft=False, drop the LoRA arguments, and usually shrink the batch size (because every parameter now receives gradients).
Note: Full fine-tuning can be 10-20 times more memory-intensive than LoRA.
full_sft = SFT(
train_dataset=sft_train,
use_peft=False, # full FT
output_dir="./tmp/sft_full",
per_device_train_batch_size=1,
num_train_epochs=1,
learning_rate=5e-6,
report_to="none",
seed=7,
)
full_pipeline = SteeringPipeline(
model_name_or_path=MODEL_NAME,
hf_model_kwargs={"trust_remote_code": True},
controls=[full_sft]
)
full_pipeline.steer()
The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
| Step | Training Loss |
|---|---|
| 10 | 1.668000 |
| 20 | 0.812600 |
| 30 | 0.339700 |
| 40 | 0.810500 |
| 50 | 0.347900 |
| 60 | 0.648100 |
| 70 | 0.666800 |
| 80 | 0.634900 |
| 90 | 0.636100 |
| 100 | 0.824400 |
| 110 | 0.595500 |
| 120 | 0.429500 |
| 130 | 0.518600 |
| 140 | 0.790800 |
| 150 | 0.726600 |
| 160 | 0.469200 |
| 170 | 0.773000 |
| 180 | 0.777700 |
| 190 | 1.006300 |
| 200 | 0.559700 |
| 210 | 0.429800 |
| 220 | 0.703600 |
| 230 | 0.680900 |
| 240 | 0.583600 |
| 250 | 0.418600 |
| 260 | 0.673700 |
| 270 | 0.603700 |
| 280 | 0.787700 |
| 290 | 0.713500 |
| 300 | 0.456800 |
| 310 | 0.805500 |
| 320 | 0.528500 |
| 330 | 0.515000 |
| 340 | 0.745700 |
| 350 | 0.649700 |
| 360 | 0.711400 |
| 370 | 0.620400 |
| 380 | 0.855500 |
| 390 | 0.549000 |
| 400 | 0.692200 |
| 410 | 0.395900 |
| 420 | 0.690000 |
| 430 | 0.464800 |
| 440 | 0.458200 |
| 450 | 0.925600 |
| 460 | 0.555400 |
| 470 | 0.673500 |
| 480 | 0.688000 |
| 490 | 0.711300 |
| 500 | 0.705200 |
The wrapper also provides functionality for resuming training if interrupted (via TRL's resume_from_checkpoint) by providing either the directory path of the checkpoint name in output_dir.
resume_sft = SFT(
train_dataset=sft_train,
output_dir="./tmp/sft_lora",
resume_from_checkpoint="./tmp/sft_lora/checkpoint-1000",
use_peft=True,
adapter_name="sft",
report_to="none",
)
resume_pipeline = SteeringPipeline(
model_name_or_path=MODEL_NAME,
hf_model_kwargs={"trust_remote_code": True},
controls=[resume_sft]
)
resume_pipeline.steer()
The model is already on multiple devices. Skipping the move to device specified in `args`.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': None, 'pad_token_id': 151643}.
| Step | Training Loss |
|---|---|
| 10 | 3.607000 |
| 20 | 1.869800 |
| 30 | 1.269600 |
| 40 | 1.049900 |
| 50 | 0.993700 |
| 60 | 0.854200 |
| 70 | 0.904400 |
| 80 | 0.804900 |
| 90 | 0.837500 |
| 100 | 0.963000 |
| 110 | 0.653100 |
| 120 | 0.640800 |
| 130 | 0.707400 |
| 140 | 0.730500 |
| 150 | 0.763000 |
| 160 | 0.685400 |
| 170 | 0.814300 |
| 180 | 0.710300 |