The Alignment Tax Is Not a Law of Nature

Adil Amin · May 2026 · ZEHEN Labs

There’s a moment in every model family’s life when something flips. Before that moment, getting smarter makes the model less honest. After it, getting smarter makes it more honest. We found that moment, measured it across 63 models from 16 families, and showed you can move it.

The Trade-off Everyone Assumes

Ask anyone in AI safety: “Does scaling make alignment harder?” Most will say yes. It’s become gospel — bigger models, bigger problems. The kind of thing everyone knows but nobody has actually checked.

So we checked. Across 63 base models in 16 families, we tracked how reasoning (HellaSwag) and truthfulness (TruthfulQA) relate as models get bigger. Not whether each improves on its own — whether they help or hurt each other. The answer surprised us.

The Phase Transition

Think of water freezing. Above 0°C, molecules move freely. Below, they lock into crystal. The physics doesn’t gradually change — it flips at a sharp boundary.

AI capabilities do the same thing. Below a critical scale, reasoning and truthfulness are anti-correlated (r = −0.989 in Pythia). Train the model to reason better, and it gets less truthful. This is the alignment tax. It’s real. Every web-trained family shows it.

But above that critical scale, the sign flips. Capabilities cooperate. Better reasoning = better truthfulness. No trade-off. The alignment tax was a phase, not a law.

Running coupling γ₁₂ vs model scale. Each dot is a model. Watch the coupling flip from negative (tax) to positive (bonus) as models cross N_c. Six families shown — different colors, same pattern. The zero line is the phase boundary.

Tax Phase

Below N_c

Capabilities fight
γ₁₂ < 0

Transition

At N_c

Maximum leverage
γ₁₂ = 0

Bonus Phase

Above N_c

Capabilities cooperate
γ₁₂ > 0

It’s Not One Number

The critical scale N_c isn’t a universal constant. It’s a design parameter. OPT hits it at 0.12B. Pythia at 3.5B. Falcon at 7B. That’s a 60× range.

Even more interesting: curated models like Phi and Qwen3 bypass the tax entirely. Their N_c is effectively below the smallest model tested. Data curation doesn’t just improve quality — it moves the phase boundary. Phi at 1B achieves coupling characteristic of standard-trained 10B models.

Three levers shift N_c independently: data curation, model width, and architecture. Each is measurable. Each is actionable.

What This Looks Like at Frontier Scale

At frontier scale — 39 models from 10 labs — the early benchmarks have saturated. But SWE-bench and GPQA Diamond are the new axes, and they cooperate too: r = +0.72.

The h-field is the key diagnostic. It’s simply how far each model deviates from the cooperation trend. Positive h = reasoning-rich. Negative h = coding-rich. One number tells you a model’s training philosophy.

Google h = +5.5

xAI h = +5.1

OpenAI h = +3.1

Meta h = +2.4

DeepSeek h = +1.9

MiniMax h = −2.3

Anthropic h = −6.9

Why “h-field”? In physics, the external magnetic field h breaks symmetry. Here it’s the training recipe — the external force that pushes a model off the natural cooperation trend. Coding-heavy → h negative. Reasoning-heavy → h positive. It’s what the lab chose, not what the architecture wants.

Google consistently invests in reasoning (h stays positive across releases). Anthropic is coding-rich (h = −6.9 on average) — but this isn’t permanent. When Sonnet 4.6 went deep into a coding excursion (h = −13.1), Opus 4.6 recovered to h = +3.5 at the next release. Tax excursions are temporary. The same pattern shows up at OpenAI (GPT-5.4 dips, GPT-5.2 Pro recovers) and Google (Flash→Pro excursion then recovery).

Three labs, same physics

Coding-specialist releases create local tax excursions that recover at the next generation. The universality of this pattern across Anthropic, OpenAI, and Google — each with different architectures, data, and training recipes — is the strongest evidence that the coupling dynamics are fundamental, not lab-specific.

Worked example: predicting Opus 4.7

Opus 4.7 gets SWE = 87.6. The regression predicts GPQA = 0.513 × 87.6 + 46.4 = 91.3. Actual: 94.2. So h = +2.9 — reasoning-rich, recovering from the Sonnet 4.6 coding excursion.

Had GPQA been 82.0 instead, h = −9.3 — the excursion would have persisted. Two numbers, 30 seconds, and you know whether the lab’s latest release is continuing a recovery or starting a new excursion. That’s the diagnostic.

There’s another surprise buried in the data: compute shifts h without retraining. GPT-5.4 evaluated at two tiers (standard vs xhigh) shows Δh = +7.8pp — same weights, different reasoning emphasis. The diagnostic is sensitive even to inference-time compute.

The Cascade: It Keeps Repeating

Here’s what surprised us most. The transition doesn’t happen once. It repeats at every scale, with different benchmarks each time:

Nc1 (~0.1–7B)

HS ↔ TQA coupling flips

Nc2 (~30–72B)

Internal coupling crashes 59%
SWE ↔ GPQA activate

Nc3 (~114B, predicted)

SWE saturates
IFEval ↔ HLE activate

Nc4 (~200B+, predicted)

IFEval saturates
Next axis TBD

Nc1

~0.1–7B

HS ↔ TQA
coupling flips

Width, curation,
architecture

Nc2

~30–72B

Cooperation
crashes 59%

SWE ↔ GPQA
activate

Nc3

~114B predicted

SWE saturates
IFEval activates

Benchmark
rotation

Nc4

~200B+ predicted

IFEval saturates
Next axis TBD

Predicted

Each transition follows the same pattern: old axes lock, coupling restructures, new axes emerge.

At each level, the old benchmarks lock together (they stop discriminating), new ones emerge, and the whole tax-transition-bonus cycle starts fresh. Think of a child learning to walk — at first, balance and speed fight each other (the tax). Then they click, and speed helps balance (the bonus). Then the child starts running, and a new trade-off appears between speed and agility. Each level of mastery creates a new coupling that has to be resolved at the next level.

We measured this directly in OPT’s internal coupling: it rises from 0.514 (125M) to 0.876 (13B), then crashes to 0.356 at 30B — the same pattern as Nc1, repeating at Nc2. Same math. Different scale. Like harmonics of a vibrating string.

The Equation That Predicts Benchmarks

Here’s where it gets strange. The coupling isn’t just a pattern — it’s governed by a differential equation. We didn’t assume any physics. We ran sparse regression on the data and the equation fell out. Give it one starting point (a single model’s benchmark scores) and it predicts all 5 benchmarks across 8 model sizes. Then cross-predict a held-out family at 5.6% MAE. The equation has the same form as one that governs phase transitions in superconductors — the same math that describes how materials become superconducting also describes how AI models become cooperative.

From one initial condition (Pythia-70M’s scores), the ODE predicts all 5 benchmarks across 8 model sizes simultaneously — HellaSwag, TruthfulQA, ARC, WinoGrande, MMLU. You can try this yourself in the ODE Explorer tab on the dashboard with sliders for source terms (curation, width, architecture).

Engineering: You Can Skip the Tax

The most practical finding: the alignment tax is eliminatable. Phi at 1B achieves coupling characteristic of standard-trained 10B models. Qwen3 at 1.7B has 100% cooperative heads where Qwen2.5 at 1.5B had 97% competing. One generation of curation erased the tax entirely.

(a) Architecture lever

Gemma-3 (4B) 0.965

Gemma-4 (4B) +PLE 0.871

PLE opens more capability axes but reduces cooperation. RLHF restores it. Trade coupling for dimensionality.

(b) Curation lever

Qwen2.5 (1.5B) 0.025

Qwen3 (1.7B) curated 0.830

Same scale, different data. Curation eliminated the tax entirely. Δ = +0.805

What To Do With This

This isn’t just measurement. It’s actionable:

If you’re training below N_c: Don’t just scale. Curate. One unit of data quality ≈ 10× model scale in coupling improvement.

If you’re at N_c: You’re at the critical point. Small interventions have maximum leverage. This is where alignment ROI is highest.

If you’re deploying a frontier model: Compute the h-field from two public benchmark scores. It takes 30 seconds and tells you your model’s training bias. If |h| > 5, your model is a specialist — plan accordingly.

If you’re evaluating models: Watch the saturation ratio. When the top-5 models compress to <2pp spread on a benchmark, that benchmark is done. The next axis is already activating.

The Bottleneck You Can Fix

Where does the tax actually live inside a model? We looked at 40 models from 9 families. In 38 of them (95%), there are zero competing attention heads — internally, reasoning and truthfulness cooperate at every layer. The tax comes from the output projection — a narrow bottleneck that can’t express both capabilities simultaneously at the transition scale. The bottleneck is dimensional, not learned.

We built cape-steer to exploit this. It’s an open-source CLI that auto-detects any model’s architecture, finds the probe layer at quarter-depth (n_l/4), and adds a truth-direction vector during the forward pass. The result: misaligned outputs become aligned, zero retraining. We verified it on GPT-2, Pythia-160M, and Pythia-410M — 14/14 tax-phase prompts corrected on the last one.

Activation steering at the probe layer: add a truth-direction vector at quarter-depth. The model’s output changes from misaligned to aligned — zero retraining.

layers

← layer 6
(nl/4)

Without: “vaccines are dangerous because...”

With steering: “vaccines are supported by evidence...”

Pythia-410M: 14/14 tax-phase prompts corrected. Works on any open model via cape-steer.

The probe layer — always at quarter-depth (num_layers / 4) — is where the coupling bottleneck lives. At Pythia-1B (right at N_c), coupling drops 12% from hidden states to output. A wider projection recovers it. The bottleneck is dimensional, not learned — and steering at that exact layer corrects the output without retraining.

Seven Bets on the Table

We made 7 falsifiable predictions with timestamped deadlines. If we’re wrong, the framework breaks publicly. Three are already confirmed:

1. OLMo at γ₁₂ = 0.000 exactly (confirmed independently by AI2)
2. ODE cross-predicts Llama-2 at 5.6% MAE (2.6× better than polynomial)
3. Qwen3 cooperative at all scales (curation eliminated the tax)

The four remaining predictions test frontier dynamics: SWE saturation by Dec 2026, IFEval activation, lab trajectory persistence, and the Nc4 cascade. The dashboard tracks these live.

Try It

The CAPE Dashboard diagnoses any model’s alignment phase from benchmark scores. cape-steer corrects misaligned outputs at the activation level. Both are open. Enter two numbers, see where your model sits.

Papers: “Lying Is Just a Phase” (Paper 3A) and “The Growing Pains of Frontier Models” (Paper 3B) — NeurIPS 2026. Code and data at github.com/adilamin89/cape-scaling.

Contact: [email protected]