Welcome!
The AI Steerability 360 toolkit is an extensible library for general purpose steering of LLMs.
The term steering describes any deliberate action to change a model's behavior. Building on this term, the concept of steerability has come to describe the ease (and extent) to which a model can be steered to a given behavior.123 Quantifying a model's steerability is desirable primarily in that it enables a better understanding of how much a model's generations can be controlled and, in turn, contributes to a better understanding of the model's general usability, safety, and alignment.4
The AI Steerability 360 toolkit (AISteer360) provides a structured framework for both steering models and evaluating their steerability. To help organize the wide range of steering methods (e.g., few-shot learning, activation steering, attention reweighting, parameter-efficient fine-tuning, reward-driven decoding, etc.), the toolkit structures methods (hereafter referred to as controls) across four categories: input, structural, state, and output. Assuming that outputs \( y \) are generated from a base (unsteered) model as \( y \sim p_\theta(x) \), where \( x \) is the input/prompt, \( \theta \) is the model's parameters, and \( p_\theta(x) \) is the model's (conditional) distribution over outputs given \( x \), control for each category is exerted as follows.
-
Input control: \( y \sim p_\theta(\sigma(x)) \)
- Methods that manipulate the input/prompt to guide model behavior without modifying the model.
- Facilitated through a prompt adapter \( \sigma(x) \) applied to the original prompt \( x \).
-
Structural control: \( y \sim p_{\theta'}(x) \)
- Methods that modify the model’s underlying parameters or augment the model’s architecture.
- Facilitated through fine-tuning, adapter layers, or architectural interventions to yield weights \( \theta' \).
-
State control: \( y \sim p_{\theta}^a(x) \)
- Methods that modify the model’s internal states (e.g., activations, attentions) at inference time.
- Facilitated through hooks that are inserted into the model to manipulate internal variables during the forward pass.
-
Output control: \( y \overset d\sim p_\theta(x) \)
- Methods that modify model outputs or constrain/transform what leaves the decoder.
- Facilitated through decoding-time algorithms/filters that override the
generate
method.
Given the above structure, AISteer360 enables the composition of various controls into a single operation on a given model (each exercising control over a different component), in what we term a steering pipeline. Steering pipelines can consist of simply a single control (e.g., activation steering) or a sequence of multiple controls (e.g., LoRA following by reward-augmented decoding). This flexibility allows users to evaluate the impact of various steering methods (and combinations thereof) on a given model.
To facilitate a principled comparison, we have developed UseCase
and Benchmark
classes. Use cases define tasks for a
(steered) model and specify how performance on that task is measured (via evaluation metrics on the model's generations).
Benchmarks facilitate the comparison of steering pipelines on a given use case. This provides a unified structure for
testing and comparing methods, addressing the current fragmentation in the field where steering algorithms are typically
developed and evaluated within isolated, task-specific environments.5
We encourage the community to use AISteer360 in their steering workflows. We will continue to develop in the open, and encourage users to suggest any additional features or raise any issues on our GitHub page.
-
Erik Miehling, Michael Desmond, Karthikeyan Natesan Ramamurthy, Elizabeth M. Daly, Kush R. Varshney, Eitan Farchi, Pierre Dognin, Jesus Rios, Djallel Bouneffouf, Miao Liu, and Prasanna Sattigeri. Evaluating the prompt steerability of large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 7874–7900. Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. URL: https://aclanthology.org/2025.naacl-long.400/, doi:10.18653/v1/2025.naacl-long.400. ↩
-
Keyon Vafa, Sarah Bentley, Jon Kleinberg, and Sendhil Mullainathan. What's producible may not be reachable: Measuring the steerability of generative models. arXiv preprint arXiv:2503.17482, 2025. ↩
-
Trenton Chang, Tobias Schnabel, Adith Swaminathan, and Jenna Wiens. A course correction in steerability evaluation: Revealing miscalibration and side effects in LLMs. arXiv preprint arXiv:2505.23816, 2025. ↩
-
Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, and others. A roadmap to pluralistic alignment. arXiv preprint arXiv:2402.05070, 2024. ↩
-
Xun Liang, Hanyu Wang, Yezhaohui Wang, Shichao Song, Jiawei Yang, Simin Niu, Jie Hu, Dan Liu, Shunyu Yao, Feiyu Xiong, and others. Controllable text generation for large language models: A survey. arXiv preprint arXiv:2408.12599, 2024. ↩