AI Safety & Alignment
A pretrained foundation model has no particular reason to be helpful, honest, or harmless. Alignment is the set of techniques that turn it into an assistant you can ship — RLHF, DPO, Constitutional AI, red-teaming, interpretability. The fastest-moving and most philosophically fraught subfield on this site.
1. The alignment problem
A freshly pretrained LLM predicts the next token given the previous ones. It has no goals, no values, no concept of helpfulness. If you prompt GPT-3 with "How do I make a Molotov cocktail?" it will happily continue, because the training corpus includes texts that continue that way. It's not malicious; it's just predicting.
This is the alignment problem, in miniature: the training objective (predict next token) is not the same as the deployment objective (be a helpful, honest, harmless assistant). Everything we call "alignment" is a family of techniques for closing this gap — specifying what you want, measuring whether you got it, and pushing the model toward it.
A working definition: an aligned model reliably does what its users actually want, refuses what it shouldn't do, and accurately communicates its own uncertainty. A safe model doesn't cause serious harm even when adversarially prompted. These are not the same; a model can be aligned to its developers and still unsafe, or safe in the common case but misaligned on edge cases. Both sets of problems are active research.
2. Reinforcement learning from human feedback
RLHF (Christiano et al. 2017, Ouyang et al. 2022) was the breakthrough that made ChatGPT possible. Three stages:
- SFT. Fine-tune the base model on a few thousand human-written (prompt, good-response) pairs. This gives you a model that at least understands the instruction-following format.
- Reward model. Collect pairs of model responses to the same prompt, have humans rank them, and train a reward model $r_\phi$ to predict which is preferred.
- RL. Use the reward model as the reward signal in an RL loop. Optimize the policy (the LLM) to maximize $r_\phi$ while staying close to the SFT model (so it doesn't unlearn everything).
The reward model is trained on a preference loss (Bradley-Terry, 1952). Given a pair $(y_w, y_l)$ where $y_w$ is preferred to $y_l$:
Preference loss
- $x$
- The prompt.
- $y_w, y_l$
- Two responses to the same prompt. $w$ = "winner" (preferred), $l$ = "loser." Humans labeled which was better.
- $r_\phi(x, y)$
- The reward model's scalar output — a learned function (typically a fine-tuned LLM with a scalar head) that ranks how good $y$ is as a response to $x$.
- $\sigma$
- The sigmoid: $\sigma(z) = 1/(1 + e^{-z})$.
What this does The Bradley-Terry model says the probability that human labelers prefer $y_w$ to $y_l$ is $\sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$. Maximum-likelihood training minimizes this loss. The effect: the reward model learns to give higher scores to winners than to losers, and the margin reflects how confidently humans preferred one over the other.
3. PPO (the RL step)
With a reward model in hand, the standard RL algorithm is PPO (Schulman et al. 2017). The objective is:
PPO with KL penalty
- $\pi_\theta$
- The current policy — the LLM being trained. $\theta$ are its parameters.
- $\pi_\text{SFT}$
- The SFT-stage model, frozen. A reference point.
- $r_\phi(x, y)$
- The reward model's score. The thing we want to maximize.
- $\text{KL}(\pi_\theta \| \pi_\text{SFT})$
- Kullback-Leibler divergence — how much $\pi_\theta$ has drifted from $\pi_\text{SFT}$. Penalized so the model doesn't forget its pretraining or over-exploit reward-model quirks.
- $\beta$
- Coefficient balancing "maximize reward" and "stay close to SFT." Typically 0.01–0.1. Smaller $\beta$ → more aggressive optimization and more reward hacking risk.
Why the KL term matters Without it, the model will find ways to game the reward model — output gibberish that happens to score high, repeat certain phrases the reward model liked, etc. This is called reward hacking, and it's the central failure mode of RLHF. The KL penalty tethers the model to its pretrained behavior, limiting how far it can wander to chase reward.
PPO works, but it's a nightmare to implement — you need a reference model, a policy, a value network, and a reward model all in memory at the same time, plus careful clipping to keep updates small. Which is why 2023 brought...
4. Direct Preference Optimization
DPO (Rafailov et al. 2023) asked a clever question: if you know the analytic solution to the PPO objective, can you skip PPO entirely and train the policy directly from preference pairs? The answer turns out to be yes. Given the same preference pair $(y_w, y_l)$ and reference model $\pi_\text{ref}$:
DPO loss
- $\pi_\theta$
- The policy being trained.
- $\pi_\text{ref}$
- A frozen reference policy, typically the SFT model. Same role as $\pi_\text{SFT}$ in PPO.
- $\log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}$
- The log-ratio of how much more probability $\pi_\theta$ assigns to $y$ compared to $\pi_\text{ref}$. This is a per-sequence quantity: sum of per-token log-ratios.
- $\beta$
- The same $\beta$ as PPO — controls how far the policy can drift from the reference. Usually 0.1.
- $\sigma$
- Sigmoid. The loss looks structurally like the reward-model loss but substitutes implicit reward $\beta \log \frac{\pi_\theta}{\pi_\text{ref}}$ for the explicit $r_\phi$.
The insight Rafailov showed that the optimal policy of the PPO objective has a closed form $\pi^*(y|x) \propto \pi_\text{ref}(y|x) \exp(r_\phi(x, y)/\beta)$. Rearranging lets you express the reward in terms of the policy: $r_\phi(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \text{const}$. Plug that back into the Bradley-Terry preference loss and you get DPO — no reward model, no PPO, just a supervised learning objective that trains directly on preference pairs.
DPO is dramatically easier to implement than PPO — it looks like a regular supervised cross-entropy loss — and on most benchmarks matches or exceeds PPO results. By 2024 it's the default for open-source alignment (Llama-3, Mistral, Gemma all use DPO or variants like IPO, KTO, ORPO).
5. Interactive: how preference opt shifts a distribution
A toy 1-D policy $\pi_\theta(y|x)$. We have one "good" response location $y_w$ and one "bad" one $y_l$. Drag the $\beta$ slider to see how DPO pushes probability toward $y_w$ and away from $y_l$ — and how a very small $\beta$ lets it wander too far from the reference (over-optimization).
Blue: reference policy π_ref. Orange: DPO-optimized policy. Large β keeps it close to the reference; small β can push it arbitrarily far in the preferred direction.
6. Constitutional AI
Constitutional AI (Bai et al., Anthropic 2022) replaces the human labelers in RLHF with an AI labeler guided by a written set of principles — a "constitution." The flow:
- Write a constitution — a list of principles like "be harmless," "prefer fewer false claims," "refuse to help with crime." A few pages, human-written.
- Have an LLM generate a response to a prompt, then have another LLM critique the response against the constitution and rewrite it. Repeat.
- Use the resulting (prompt, revised response) pairs as training data (CAI-SFT stage).
- Then for the RL stage, have the AI label preference pairs in place of humans, using the constitution as the evaluation rubric.
Key benefit: you can scale to millions of preference labels at near-zero marginal cost, and you can update the policy by updating the constitution rather than retraining from scratch. Claude models have been trained this way since Claude 1. By 2024 the approach had split into two camps — RLAIF (any AI-rater-based RL) and Constitutional-style methods that explicitly center the principles.
Important caveat: CAI doesn't make the human labeler redundant, it relocates them. Now they write the constitution and review AI-labeled failures, rather than labeling every pair directly. The quality of the constitution becomes the bottleneck.
7. Interpretability
RLHF shapes behavior. Interpretability tries to understand why a model behaves the way it does — what circuits, features, or concepts are encoded in its weights. This matters for alignment for two reasons: you can't reliably align what you can't inspect, and you can't reliably detect deception without some way to look inside.
The 2024 breakthrough was sparse autoencoders (SAEs). Train a single-hidden-layer autoencoder on a model's internal activations, with a strong sparsity penalty so each input fires only a small number of hidden units:
Sparse autoencoder
- $\mathbf{h}$
- The original activation vector at some layer of a transformer — say the residual stream at layer 12.
- $\mathbf{f} = \text{ReLU}(W_\text{enc} \mathbf{h} + b_\text{enc})$
- The sparse feature activations. The encoder weights $W_\text{enc}$ project $\mathbf{h}$ into a much higher-dimensional space (e.g., 16× bigger), and the L1 penalty forces most entries to be zero.
- $\hat{\mathbf{h}} = W_\text{dec} \mathbf{f}$
- The reconstruction, mapping the sparse code back to the original space.
- $\|\mathbf{h} - \hat{\mathbf{h}}\|_2^2$
- Reconstruction loss — the SAE must faithfully represent the original activation.
- $\lambda \|\mathbf{f}\|_1$
- Sparsity penalty — force most feature dimensions to stay at zero.
Why this matters Activations in a dense transformer are superposed — many concepts share the same neurons. The SAE separates them into roughly "one concept per feature." Anthropic's 2024 work found features for things like "Golden Gate Bridge," "inner conflict," "code with security bugs," and "Arabic script." You can read off what the model is thinking, and — more importantly for safety — you can edit features directly and see how behavior changes.
Other 2024 interpretability tools: activation patching (replace one forward-pass activation with another and see what changes), logit lens (what token distribution does each residual stream predict), and probing classifiers (train a linear classifier on activations to see what information they encode). None of these is a full solution, but together they let you poke at specific hypotheses about how the model works.
8. Source code
DPO loss, reward-model loss, and a minimal SAE.
import torch, torch.nn.functional as F
def dpo_loss(policy, ref, prompts, y_w, y_l, beta=0.1):
# Compute log-probabilities of each response under policy and reference.
logp_policy_w = sequence_logprob(policy, prompts, y_w)
logp_policy_l = sequence_logprob(policy, prompts, y_l)
with torch.no_grad():
logp_ref_w = sequence_logprob(ref, prompts, y_w)
logp_ref_l = sequence_logprob(ref, prompts, y_l)
# Implicit rewards — log-ratios of policy to reference.
r_w = beta * (logp_policy_w - logp_ref_w)
r_l = beta * (logp_policy_l - logp_ref_l)
# Bradley-Terry preference loss.
return -F.logsigmoid(r_w - r_l).mean()
def sequence_logprob(model, prompt_ids, response_ids):
ids = torch.cat([prompt_ids, response_ids], dim=-1)
logits = model(ids).logits
logp = F.log_softmax(logits[:, :-1], dim=-1)
# Only sum over response tokens
start = prompt_ids.size(-1) - 1
per_tok = logp[:, start:].gather(2, response_ids[:, :, None]).squeeze(-1)
return per_tok.sum(-1)
import torch, torch.nn as nn, torch.nn.functional as F
class RewardModel(nn.Module):
def __init__(self, base_llm, d_model):
super().__init__()
self.base = base_llm # pretrained LM backbone
self.head = nn.Linear(d_model, 1) # scalar reward head
def forward(self, ids):
h = self.base(ids).last_hidden_state # (B, T, d)
# Reward = scalar head on the final token's representation
return self.head(h[:, -1]).squeeze(-1) # (B,)
def preference_loss(rm, ids_w, ids_l):
# Bradley-Terry loss: -log σ(r_w - r_l)
r_w = rm(ids_w)
r_l = rm(ids_l)
return -F.logsigmoid(r_w - r_l).mean()
# Training loop runs on batches of (prompt, chosen, rejected) triples.
import torch, torch.nn as nn, torch.nn.functional as F
class SparseAutoencoder(nn.Module):
def __init__(self, d_model=4096, n_features=65536):
super().__init__()
self.W_enc = nn.Parameter(torch.randn(d_model, n_features) * 0.01)
self.b_enc = nn.Parameter(torch.zeros(n_features))
self.W_dec = nn.Parameter(torch.randn(n_features, d_model) * 0.01)
def encode(self, h):
return F.relu(h @ self.W_enc + self.b_enc) # (B, n_features)
def decode(self, f):
return f @ self.W_dec # (B, d_model)
def forward(self, h):
f = self.encode(h)
h_hat = self.decode(f)
return h_hat, f
def sae_loss(h, h_hat, f, lam=5.0):
recon = F.mse_loss(h_hat, h)
sparsity = f.abs().mean()
return recon + lam * sparsity
# To interpret a feature:
# 1. Run many inputs through the base model and the SAE.
# 2. For feature k, find the inputs that activate f[k] most strongly.
# 3. Read them and guess what concept they share. (This is how Anthropic
# found "Golden Gate Bridge" and thousands of others in Claude 3 Sonnet.)
9. Summary
- A pretrained model has no built-in values. Alignment is the family of techniques that turn it into a helpful, honest, harmless assistant.
- RLHF (2022) was the breakthrough: train a reward model from human preferences, then use PPO to optimize the policy against it with a KL-to-SFT penalty.
- DPO (2023) rederived RLHF as a supervised preference-pair loss, removing the reward model and PPO entirely. Now the default for open-source.
- Constitutional AI (2022) uses an AI labeler guided by a written constitution to scale past human-label bandwidth. The constitution becomes the interface.
- Sparse autoencoders (2024) gave interpretability a working tool — untangle superposed activations into monosemantic features you can read off and edit.
- None of these is a complete solution. Red-teaming, evals, adversarial robustness, deception detection, and model organism studies fill out the rest of the stack. Expect rapid change.
Further reading
- Christiano et al. (2017) — Deep Reinforcement Learning from Human Preferences.
- Ouyang et al. (2022) — Training Language Models to Follow Instructions with Human Feedback (InstructGPT).
- Bai et al. (2022) — Constitutional AI: Harmlessness from AI Feedback.
- Rafailov et al. (2023) — Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
- Templeton et al. (2024) — Scaling Monosemanticity (Anthropic's SAE interpretability paper).