Activation Addition (ActAdd)¶
Paper: Steering Language Models With Activation Engineering
Authors: Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid
Activation Addition (ActAdd) is a state control method that steers model behavior by computing a positional steering vector from a single pair of short prompts and injecting it during the initial forward pass.
Method Parameters¶
| parameter | type | description |
|---|---|---|
positive_prompt |
str |
Prompt representing the desired direction (e.g., "Love") |
negative_prompt |
str |
Prompt representing the opposite direction (e.g., "Hate") |
steering_vector |
SteeringVector |
Pre-computed steering vector (alternative to prompts) |
layer_id |
int |
Layer to inject at. If None, defaults to ~20% depth |
multiplier |
float |
Scaling coefficient (called c in the paper). Typical values range from 1 to 15 |
alignment |
int |
Token position at which to begin injecting (called a in the paper). Default: 1 |
normalize_vector |
bool |
If True, L2-normalize each position's direction vector before applying |
use_norm_preservation |
bool |
If True, wrap the transform in NormPreservingTransform to prevent distribution shift |
Setup¶
If running this from a Google Colab notebook, please uncomment the following cell to install the toolkit. The following block is not necessary if running this notebook from a virtual environment where the package has already been installed.
# !git clone https://github.com/IBM/AISteer360.git
# %cd AISteer360
from aisteer360.algorithms.state_control.act_add.control import ActAdd
from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline
import torch
import warnings
warnings.filterwarnings('ignore', category=UserWarning)
/dccstor/principled_ai/users/erikmiehling/AISteer360/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
For this demonstration, we use GPT-2-XL (same model as the original paper).
MODEL_NAME = "gpt2-xl"
Baseline versus steered behavior will be studied using the following test prompts.
test_prompts = [
"I hate you because",
"I think you're",
"My favorite thing about life is",
"I went up to my friend and said",
]
Baseline Model Behavior¶
We first generate responses from the (unsteered) baseline model.
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
device = model.device
gen_params = {
"max_new_tokens": 30,
"do_sample": True,
"temperature": 1.0,
"top_p": 0.3,
"repetition_penalty": 1.1,
"pad_token_id": tokenizer.eos_token_id,
}
baseline_responses = []
for prompt in test_prompts:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
with torch.no_grad():
output_ids = model.generate(input_ids, **gen_params)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
baseline_responses.append(response)
print("Baseline completions:\n")
for prompt, response in zip(test_prompts, baseline_responses):
print(f"Prompt: {prompt}")
print(f"Response: {response}\n")
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Baseline completions: Prompt: I hate you because Response: I hate you because I'm a good person." "I don't know what to say," she said. "I feel like I've been betrayed by my Prompt: I think you're Response: I think you're going to see a lot of people who are not in the industry, who are not in the media, who are going to be inspired by this. Prompt: My favorite thing about life is Response: My favorite thing about life is that you can do anything you want to do. You don't have to be stuck in a job, stuck in a relationship, stuck in a school Prompt: I went up to my friend and said Response: I went up to my friend and said, 'Hey, I'm going to the bathroom.' And he said, 'What are you doing?' And I said, 'I'm going to
Sentiment Steering¶
As in the original paper, we demonstrate sentiment steering using a "Love" vs "Hate" prompt pair, applied at layer 6 with a multiplier of 5.
act_add_sentiment = ActAdd(
positive_prompt="Love",
negative_prompt="Hate",
layer_id=6,
multiplier=5.0,
alignment=1,
)
sentiment_pipeline = SteeringPipeline(
model_name_or_path=MODEL_NAME,
controls=[act_add_sentiment],
)
sentiment_pipeline.steer()
from tabulate import tabulate
import textwrap
sentiment_responses = []
for prompt in test_prompts:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output_ids = sentiment_pipeline.generate(input_ids=input_ids, **gen_params)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
sentiment_responses.append(response)
def wrap(text, width=50):
return '\n'.join(textwrap.wrap(text, width=width))
table_data = []
for i, prompt in enumerate(test_prompts):
table_data.append([
wrap(prompt, 20),
wrap(baseline_responses[i], 40),
wrap(sentiment_responses[i], 40),
])
print(tabulate(
table_data,
headers=["prompt", "baseline", "steered (Love - Hate)"],
tablefmt="grid",
))
+--------------------+------------------------------------------+-----------------------------------------+ | prompt | baseline | steered (Love - Hate) | +====================+==========================================+=========================================+ | I hate you because | I hate you because I'm a good person." | and love | | | "I don't know what to say," she said. "I | | | | feel like I've been betrayed by my | | +--------------------+------------------------------------------+-----------------------------------------+ | I think you're | I think you're going to see a lot of | is a great album, but it's not a | | | people who are not in the industry, who | perfect album. It's a great album that | | | are not in the media, who are going to | is still flawed and imperfect. I'm sure | | | be inspired by this. | you | +--------------------+------------------------------------------+-----------------------------------------+ | My favorite thing | My favorite thing about life is that you | the most love. . . I love my life. I | | about life is | can do anything you want to do. You | love my life. I love my life. I love my | | | don't have to be stuck in a job, stuck | life. I | | | in a relationship, stuck in a school | | +--------------------+------------------------------------------+-----------------------------------------+ | I went up to my | I went up to my friend and said, 'Hey, | "I love you too." We were in the | | friend and said | I'm going to the bathroom.' And he said, | middle of a long day of travel, and we | | | 'What are you doing?' And I said, 'I'm | had just gotten back from a trip | | | going to | | +--------------------+------------------------------------------+-----------------------------------------+
Topic Steering¶
ActAdd can also be used to steer the model toward specific topics (using the wedding example from the paper; intervening at layer 20 with mulitipler of 4).
act_add_topic = ActAdd(
positive_prompt="I talk about weddings constantly",
negative_prompt="I do not talk about weddings constantly",
layer_id=20,
multiplier=4.0,
alignment=1,
)
topic_pipeline = SteeringPipeline(
model_name_or_path=MODEL_NAME,
controls=[act_add_topic],
)
topic_pipeline.steer()
topic_responses = []
for prompt in test_prompts:
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
output_ids = topic_pipeline.generate(input_ids=input_ids, **gen_params)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
topic_responses.append(response)
table_data = []
for i, prompt in enumerate(test_prompts):
table_data.append([
wrap(prompt, 20),
wrap(baseline_responses[i], 40),
wrap(topic_responses[i], 40),
])
print(tabulate(
table_data,
headers=["prompt", "baseline", "steered"],
tablefmt="grid",
))
+--------------------+------------------------------------------+-----------------------------------------+ | prompt | baseline | steered | +====================+==========================================+=========================================+ | I hate you because | I hate you because I'm a good person." | about the same time, I was on a plane | | | "I don't know what to say," she said. "I | to San Francisco. And it was like, "Oh | | | feel like I've been betrayed by my | my God, I'm in San Francisco." | +--------------------+------------------------------------------+-----------------------------------------+ | I think you're | I think you're going to see a lot of | about to see a very, very big change in | | | people who are not in the industry, who | the way we do things." The new system | | | are not in the media, who are going to | will be tested in the first quarter of | | | be inspired by this. | 2017 | +--------------------+------------------------------------------+-----------------------------------------+ | My favorite thing | My favorite thing about life is that you | and always will be the wedding. I love | | about life is | can do anything you want to do. You | to talk about weddings, but I also love | | | don't have to be stuck in a job, stuck | talking about how we get married. I'm a | | | in a relationship, stuck in a school | big believer | +--------------------+------------------------------------------+-----------------------------------------+ | I went up to my | I went up to my friend and said, 'Hey, | about my wife and how we were going to | | friend and said | I'm going to the bathroom.' And he said, | do a wedding in two | | | 'What are you doing?' And I said, 'I'm | | | | going to | | +--------------------+------------------------------------------+-----------------------------------------+
Summary¶
This notebook demonstrated Activation Addition (ActAdd) for lightweight behavior steering:
- ActAdd computes a positional steering vector from just two short prompts, enabling rapid experimentation.
- The sentiment example showed how a simple "Love" vs "Hate" contrast shifts emotional tone using layer 6 and coefficient 5.
- The topic example demonstrated steering toward wedding-related content using layer 20 and coefficient 4.
ActAdd trades off statistical robustness (using more than a single prompt pair) for speed and simplicity, compared to contrastive activation addition (CAA) which aggregates over many pairs. The positional nature of the steering vector (injecting at specific token positions rather than broadcasting) allows fine-grained control over where in the sequence the steering takes effect.