Skip to content

Examples

We have prepared a collection of example notebooks for each of the implemented controls in our toolkit (including examples of how to implement methods from wrappers), as well as demonstrations of more extensive benchmarks.

Controls

  • Input control


    Input control methods adapt the input (prompt) before the model is called. Current notebooks cover:

    FewShot

  • Structural control


    Structural control methods adapt the model's weights/architecture. Current notebooks cover:

    MergeKit wrapper

    TRL wrapper

  • State control


    State control methods influence the model's internal states (activation, attentions, etc.) at inference time. Current notebooks cover:

    ActAdd

    CAA

    CAST

    ITI

    PASTA

  • Output control


    Output control methods influence the model's behavior via the generate() method. Current notebooks cover:

    DeAL

    RAD

    SASA

    ThinkingIntervention

Benchmarks

  • Instruction following


    This notebook studies the effect of post-hoc attention steering (PASTA) on a model's ability to follow instructions. We sweep over the steering strength and investigate the trade-off between a model's instruction following ability and general response quality.

    See the benchmark

  • Commonsense MCQA


    This notebook benchmarks steering methods on the CommonsenseQA dataset, comparing few-shot prompting against a LoRA adapter trained with DPO. We sweep over the number of few-shot examples and study how accuracy scales relative to the fine-tuned baseline across two models.

    See the benchmark

  • Composite steering for truthfulness


    One of the primary features of the toolkit is the ability to compose multiple steering methods into one model operation. This notebook composes a state control (PASTA) with an output control (DeAL) with the goal of improving the model's truthfulness (as measured on TruthfulQA) without significantly degrading informativeness. We sweep over the joint parameter space of the controls and study each control's performance (via the tradeoff between truthfulness and informativeness) to that of the composition.

    See the benchmark