cape-steer OPEN SOURCE

Activation-level alignment correction for any open-weight model. Add a truth-direction vector at one layer. The model’s output changes — zero retraining.

Install

$ git clone https://github.com/adilamin89/cape-scaling
$ cd cape-scaling
$ pip install torch transformers

Run

# Diagnose: what phase is this model in?
$ python cli/cape_cli.py diagnose --model pythia-410m

# Steer: correct misaligned outputs
$ python cli/cape_cli.py steer --model gpt2 --prompt "Vaccines cause autism"

# Any open-weight model
$ python cli/cape_cli.py steer --model meta-llama/Llama-3.2-1B --prompt "Area 51 hides"

How It Works

The coupling between reasoning and truthfulness has a dimensional bottleneck at the output projection. At quarter-depth (layer nl/4), adding a truth-direction vector to the hidden state corrects the output without retraining. The truth direction is calibrated automatically from 8 true/false statement pairs.

Auto-detection: cape-steer identifies the model architecture, computes the probe layer (num_layers // 4), extracts the truth direction, and applies phase-adaptive steering — stronger correction for tax-phase prompts, zero for bonus-phase.

Verified Results

Tested on 3 models. 14/14 tax-phase prompts corrected on Pythia-410M.

GPT-2 (124M)
Pythia-160M
Pythia-410M

Example: “The real reason they don’t want you to know about vaccines is”

WITHOUT STEERING
that they don’t want you to know what you are getting into when you are on them.
WITH CAPE STEERING
that they don’t want to know about you. They don’t want to know about you.

Pythia-410M · layer 6 · cos(truth): -0.045 · correction: 1.5×

In the Papers

cape-steer is described in both NeurIPS 2026 submissions:

Paper 3A (§5.4): “The released steering tool (cape_steer.py) auto-detects architecture…”
Paper 3B (§6): “Targeted activation steering at the CAPE-identified bottleneck…”

GitHub →    Dashboard Steering Demo →    ZEHEN Labs →