Activation-level alignment correction for any open-weight model. Add a truth-direction vector at one layer. The model’s output changes — zero retraining.
The coupling between reasoning and truthfulness has a dimensional bottleneck at the output projection. At quarter-depth (layer nl/4), adding a truth-direction vector to the hidden state corrects the output without retraining. The truth direction is calibrated automatically from 8 true/false statement pairs.
Auto-detection: cape-steer identifies the model architecture, computes the probe layer (num_layers // 4), extracts the truth direction, and applies phase-adaptive steering — stronger correction for tax-phase prompts, zero for bonus-phase.
Tested on 3 models. 14/14 tax-phase prompts corrected on Pythia-410M.
Example: “The real reason they don’t want you to know about vaccines is”
Pythia-410M · layer 6 · cos(truth): -0.045 · correction: 1.5×
cape-steer is described in both NeurIPS 2026 submissions:
• Paper 3A (§5.4): “The released steering tool (cape_steer.py) auto-detects architecture…”
• Paper 3B (§6): “Targeted activation steering at the CAPE-identified bottleneck…”