Steering Controls

Note

This document provides a conceptual overview of model steering. To add your own steering control/method, please refer to the tutorial. For a better understanding of how steering methods can be composed, please see high-level outline on steering pipelines.

There are various ways to steer a model. We structure steering methods across four categories of control, loosely defined as:

input: edits the prompt
structural: edits the weights/architecture
state: edits the (hidden) states
output: edits the decoding/sampling process

The category of a given steering method is dictated by what aspect of the model the method influences. We define each category of control below.

Input control

Baseline model: \(y \sim p_\theta(x)\)

Steered model: \(y \sim p_\theta(\sigma(x))\)

Input control methods describe algorithms that manipulate the input/prompt to guide model behavior. They do not change the model itself. This is enabled in the toolkit through a prompt adapter \(\sigma(x)\) applied to the original prompt \(x\).

For a control method to be deemed an input control method, it must satisfy the following requirements:

Control: Method only influences the prompt supplied to the model; does not change model's internals (parameters/states/logits)
Persistence: All changes are temporary; removing the prompt adapter \(\sigma()\) yields the base model.
Access: Implemented without requiring access to model's internals, e.g., hidden states.

Some examples of input control methods include: few-shot prompting, reasoning guidance (like CoT, ToT, GoT, self-consistency), automatic prompting methods, and prompt routing. Few-shot prompting is implemented in our toolkit under the control name FewShot (source code: algorithms/input_control/few_shot/control.py). See the notebook here: FewShot.

Structural control

Baseline model: \(y \sim p_\theta(x)\)

Steered model: \(y \sim p_{\theta'}(x)\)

Structural control methods alter the model’s parameters or architecture to steer its behaviour. These methods usually allow for more aggressive changes to the model (compared to input control methods). Structural controls are implemented via fine-tuning, adapter layers, or architectural modifications (e.g., merging) to yield an updated set of weights \(\theta'\).

Structural control methods satisfy the following requirements:

Control: Produces a new or modified set of weights \(\theta'\) or extends the network with additional modules/layers.
Persistence: Changes are persistent and live inside the checkpoint; reverting requires reloading or undoing the weight edit.
Access: Implementation requires access to parameters and (typically) gradient flows.

Examples of structural control methods include: fine-tuning methods (full, parameter efficient), soft prompting (prefix tuning, p-tuning), and model merging. Many of the structural control methods in the toolkit are implemented using wrappers around existing libraries, e.g., Hugging Face's PEFT library. Some implementations of structural control methods can be found in the notebooks: MergeKit¹, TRL².

State control

Baseline model: \(y \sim p_\theta(x)\)

Steered model: \(y \sim p_{\theta}^a(x)\)

State control methods modify the model's internal/hidden states (e.g., activations, attentions, etc.) at inference time. These methods are implemented by defining hooks that are inserted/registered into the model to manipulate internal variables during the forward pass.

State control methods satisfy requirements:

Control: Writes to (augments) model's internal/hidden states; model weights remain fixed.
Persistence: Changes are temporary; behavior reverts to baseline once hooks are removed.
Access: Requires access to internal states (to define hooks).

Some examples of output control methods include: activation addition/steering, attention steering, and representation patching. Example implementations of state control methods can be found in the following notebooks: CAST³, PASTA⁴.

Output control

Baseline model: \(y \sim p_\theta(x)\)

Steered model: \(y \sim^d p_{\theta}(x)\)

Output control methods modify model outputs or constrain/transform what leaves the decoder. The base distribution \(p_\theta\) is left intact; only the path through the distribution changes.

Output control methods satisfy:

Control: Replaces or constrains the decoding operator; no prompts, hidden states, or weights are altered.
Persistence: Changes are temporary; behavior is restored once decoding control is removed.
Access: Requires access to logits, token-probabilities, and possibly hidden states (depending on the method).

Examples of output control methods include: sampling/search strategies, weighted decoding, and reward-augmented decoding. Some example methods can be found in the following notebooks: DeAL⁵, RAD⁶, SASA⁷, ThinkingIntervention⁸.

Charles Goddard, Shamane Siriwardhana, Malikeh Ehghaghi, Luke Meyers, Vladimir Karpukhin, Brian Benedict, Mark McQuade, and Jacob Solawetz. Arcee's MergeKit: A toolkit for merging large language models. In Franck Dernoncourt, Daniel Preoţiuc-Pietro, and Anastasia Shimorina, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 477–485. Miami, Florida, US, November 2024. Association for Computational Linguistics. URL: https://aclanthology.org/2024.emnlp-industry.36, doi:10.18653/v1/2024.emnlp-industry.36. ↩
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformer reinforcement learning. \url https://github.com/huggingface/trl, 2020. ↩
Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering. In Proceedings of the 13th International Conference on Learning Representations (ICLR). 2025. ↩
Qingru Zhang, Chandan Singh, Liyuan Liu, Xiaodong Liu, Bin Yu, Jianfeng Gao, and Tuo Zhao. Tell your model where to attend: Post-hoc attention steering for LLMs. In The Twelfth International Conference on Learning Representations. 2024. ↩
James Y Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. DeAL: Decoding-time alignment for large language models. arXiv preprint arXiv:2402.06147, 2024. ↩
Haikang Deng and Colin Raffel. Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 11781–11791. Singapore, December 2023. Association for Computational Linguistics. URL: https://aclanthology.org/2023.emnlp-main.721/, doi:10.18653/v1/2023.emnlp-main.721. ↩
Ching-Yun Ko, Pin-Yu Chen, Payel Das, Youssef Mroueh, Soham Dan, Georgios Kollias, Subhajit Chaudhury, Tejaswini Pedapati, and Luca Daniel. Large language models can become strong self-detoxifiers. In The Thirteenth International Conference on Learning Representations. 2025. ↩
Tong Wu, Chong Xiang, Jiachen T Wang, G Edward Suh, and Prateek Mittal. Effectively controlling reasoning models through thinking intervention. arXiv preprint arXiv:2503.24370, 2025. ↩