Examples
We have prepared a collection of example notebooks for each of the implemented controls in our toolkit (including examples of how to implement methods from wrappers), as well as demonstrations of more extensive benchmarks.
Controls
-
Input control
Input control methods adapt the input (prompt) before the model is called. Current notebooks cover:
-
Structural control
Structural control methods adapt the model's weights/architecture. Current notebooks cover:
-
State control
State control methods influence the model's internal states (activation, attentions, etc.) at inference time. Current notebooks cover:
-
Output control
Output control methods influence the model's behavior via the
generate()method. Current notebooks cover:
Benchmarks
-
Instruction following
This notebook studies the effect of post-hoc attention steering (PASTA) on a model's ability to follow instructions. We sweep over the steering strength and investigate the trade-off between a model's instruction following ability and general response quality.
-
Commonsense MCQA
This notebook benchmarks steering methods on the CommonsenseQA dataset, comparing few-shot prompting against a LoRA adapter trained with DPO. We sweep over the number of few-shot examples and study how accuracy scales relative to the fine-tuned baseline across two models.
-
Composite steering for truthfulness
One of the primary features of the toolkit is the ability to compose multiple steering methods into one model operation. This notebook composes a state control (PASTA) with an output control (DeAL) with the goal of improving the model's truthfulness (as measured on TruthfulQA) without significantly degrading informativeness. We sweep over the joint parameter space of the controls and study each control's performance (via the tradeoff between truthfulness and informativeness) to that of the composition.