Reasoning & Test-Time Compute

In 2024, OpenAI's o1 and DeepSeek-R1 cracked open a door that everyone assumed was sealed: you can make a model smarter at inference time by letting it think longer. The story of how chain-of-thought went from a prompting trick to a full training regime — and why "thinking time" became a new axis of the scaling law.

Prereq: LLMs, RL basics Time to read: ~20 min Interactive figures: 1 Code: PyTorch, NumPy

1. A new axis of scaling

Since 2020, the scaling story of LLMs has been about two axes: parameters and training tokens. Double either one and loss drops predictably, per the Kaplan and Chinchilla laws. A third axis — test-time compute — was always there, but hardly anyone spent it, because the return on letting a model "think longer" at inference seemed flat. You got the same answer no matter how many tokens of padding you gave it.

Then in late 2024, OpenAI released o1, and a few months later DeepSeek published R1. Both showed something genuinely new: a model trained to produce long internal chains of thought before answering can, by spending 10× or 100× the inference compute, match a frontier model that has 10× the parameters. The scaling curve of "pass@1 accuracy vs. log(inference tokens per problem)" is roughly linear, and it doesn't saturate in any regime anyone's tested.

The shift is fundamental. The traditional log-linear law was:

$$L(N, D) \approx L_\infty + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$

where $L$ is loss, $N$ is parameters, $D$ is training tokens. The new regime adds a third term:

$$L(N, D, C_\text{test}) \approx L_\infty + \frac{A}{N^\alpha} + \frac{B}{D^\beta} + \frac{C}{C_\text{test}^\gamma}$$

Scaling axes

$L$
The model's loss (or, equivalently, some measure of error rate on a reasoning benchmark like AIME or GPQA).
$N$
Parameter count. The classical axis: bigger model, lower loss.
$D$
Training tokens processed. The other classical axis.
$C_\text{test}$
Test-time compute budget — literally, how many tokens the model generates while thinking before giving its final answer. The new axis.
$\alpha, \beta, \gamma$
Empirical exponents, all positive and all small (~0.05–0.15). Each axis gives log-linear return.

Analogy A chess engine at 1 second per move is weaker than the same engine at 1 minute per move. That's a test-time compute scaling law — and everyone has taken it for granted in chess for 60 years. The 2024 reasoning models are the first LLMs where this curve is similarly steep.

Parameters are expensive to add (you have to retrain from scratch). Tokens are expensive (you have to mine the internet). Test-time compute is elastic: you pay for it only when you need to. That economic asymmetry is what makes the reasoning paradigm so interesting.

2. Chain of thought

The foundation of all reasoning techniques is chain-of-thought (CoT) — rather than asking the model for an answer directly, you ask it to show its work. Wei et al. (2022) showed that simply prefixing a prompt with "Let's think step by step" produced large gains on math word problems. The mechanism is roughly:

  1. A transformer, to produce the next token, has one forward pass of compute proportional to its depth.
  2. By producing many tokens of intermediate reasoning before the final answer, the model effectively runs many forward passes, each with growing context — a form of unrolled depth.
  3. Each intermediate token can be conditioned on all previous ones, which lets the model break a hard problem into small steps it can handle.

CoT as a prompting trick was 2022. CoT as a training objective is the 2024 development. Modern reasoning models aren't just prompted to think; their weights are shaped — via supervised fine-tuning on long reasoning traces, then RL — to produce reasoning that actually helps, not just verbose filler.

3. Best-of-N and self-consistency

The simplest way to spend more compute at inference is to generate $N$ candidate solutions and pick the best. Two flavors:

$$\text{pass@1 with best-of-}N = \max_{i=1}^{N} \; \text{correct}(y_i)$$

In self-consistency (Wang et al. 2022), you don't need an external judge — you just take the majority-vote final answer across $N$ independently sampled chains of thought:

$$\hat{y} = \underset{y}{\arg\max} \sum_{i=1}^{N} \mathbb{1}[\text{final}(y_i) = y]$$

Self-consistency variables

$N$
Number of independent rollouts. Each one starts from the same prompt but uses different random samples at each step.
$y_i$
The $i$-th complete reasoning trace plus its final answer. These are sampled at a nonzero temperature so they diverge.
$\text{final}(y_i)$
The extracted final answer from trace $i$ — typically the number or multiple-choice letter at the end.
$\mathbb{1}[\cdot]$
Indicator function: 1 if the condition is true, 0 otherwise. The sum counts how many traces agree.
$\hat{y}$
The final consensus answer — the one that the largest number of independent traces ended up at.

Analogy Asking 20 people to solve the same problem independently and taking the answer that the most people agreed on. Works because wrong answers tend to be idiosyncratic (there are many wrong answers, only one right one), so correct answers cluster and wrong answers scatter. On GSM8K, self-consistency with $N=40$ gives a ~15 point accuracy bump over $N=1$ — for free, just spend more compute.

4. Interactive: compute budget vs. accuracy

Below is a simulator of a reasoning model solving a benchmark with $N$ independent CoT rollouts and majority voting. Each problem has a "difficulty" — a per-rollout probability of getting it right. Drag the slider to change $N$ and watch the accuracy curve respond. Real reasoning models show exactly this shape: log-linear gains with budget, flattening once you're well past the per-rollout accuracy threshold.

N (rollouts per problem): 1

Log-linear return: each doubling of N gives a roughly constant accuracy bump until the per-rollout accuracy saturates.

5. Process reward models

Self-consistency treats a reasoning chain as a black box and only judges the final answer. But every step in the chain can be right or wrong, and a single wrong step contaminates everything after it. A process reward model (PRM) is trained to score partial reasoning traces one step at a time:

$$r_\phi(s_{1:t}) \in [0, 1]$$

Process reward model

$s_{1:t}$
The first $t$ reasoning steps so far. For each step, the PRM returns a probability that this step is correct given what came before.
$r_\phi$
The PRM itself — typically a fine-tuned language model with a scalar-output head. $\phi$ are its parameters. Trained on human- or model-labeled step-correctness data (OpenAI's PRM800K dataset is the reference).
$[0, 1]$
The output range. You can interpret it as the probability the step is valid math or valid logic.

Why it matters A PRM lets you do tree search over the space of reasoning. At each step, expand multiple candidates, score each with the PRM, keep the best, expand again. This is "step-level beam search" — closer to how a chess engine searches forward than how a language model samples. It's how o1 and R1 claim to spend their inference compute.

Combined with beam search or MCTS, a PRM gives you a test-time compute curve that scales much better than plain best-of-N: instead of independently generating $N$ full traces and picking one, you're guided at every step, so wrong branches are pruned early and compute isn't wasted on dead-end reasoning.

6. RL on reasoning

The real leap of o1/R1 is that CoT generation is no longer just prompted — it's RL-trained. Given a problem and a verifier (a Python executor for code, a symbolic math checker for math, or another model for open-ended tasks), you:

  1. Sample a reasoning trace from the current model.
  2. Check if the final answer is correct.
  3. Reward the whole trace if correct, zero if not (or use a PRM for per-step rewards).
  4. Update the model to make successful traces more likely.

Mathematically, you're optimizing a standard policy gradient objective:

$$\mathcal{L}_\text{RL}(\theta) = -\mathbb{E}_{y \sim \pi_\theta(\cdot | x)} \big[ R(x, y) \big]$$

Reasoning RL loss

$\theta$
The reasoning model's parameters.
$\pi_\theta(y | x)$
The policy: the probability the model assigns to generating trace $y$ given prompt $x$. For a language model, this is just the product of per-token probabilities.
$R(x, y)$
The reward — typically 1 if the final answer in $y$ is correct, 0 otherwise. For math and code, this is automatable (run the code, check the number). For open-ended tasks you need a reward model.
$\mathbb{E}$
Expectation over model samples. In practice, you sample a batch of traces and average.

The DeepSeek-R1 recipe (Jan 2025) They showed you can skip the supervised warmup entirely. Start from a base model, apply GRPO (a PPO variant) with a verifiable reward on math and code, and the model learns to produce long reasoning chains on its own — including behaviors like self-doubt and backtracking — purely from reward pressure. This was the "AlphaZero moment" of language-model reasoning: an RL objective finding a qualitative capability no one programmed in.

Key variants that appear in 2024–25:

7. Source code

A tiny self-consistency evaluation loop, plus a one-screen GRPO advantage computation.

reasoning at test time
import collections, re

def self_consistency(model, prompt, N=32, temperature=0.7):
    # Sample N independent chains of thought.
    traces = [model.sample(prompt, temperature=temperature) for _ in range(N)]

    # Extract final answers (here: last number in each trace).
    answers = [re.findall(r"-?\d+(?:\.\d+)?", t)[-1] for t in traces]

    # Majority vote.
    counts = collections.Counter(answers)
    top, votes = counts.most_common(1)[0]
    return top, votes / N                 # answer, agreement
import torch

def grpo_advantages(rewards):
    # rewards: (G, ) — G rollouts from the same prompt
    # GRPO: normalize rewards within the group, no value net needed.
    mean = rewards.mean()
    std  = rewards.std().clamp(min=1e-6)
    return (rewards - mean) / std

def grpo_loss(logprobs_new, logprobs_old, advantages, eps=0.2):
    # Standard PPO-style clipped objective, one advantage per token.
    ratio = torch.exp(logprobs_new - logprobs_old)
    unclipped = ratio * advantages
    clipped   = torch.clamp(ratio, 1 - eps, 1 + eps) * advantages
    return -torch.min(unclipped, clipped).mean()

# Usage sketch, per training step:
#   1. Sample G trajectories per prompt at temperature > 0
#   2. Score each with a verifier (+1 correct, 0 wrong)
#   3. advantages = grpo_advantages(rewards)
#   4. loss = grpo_loss(logp_new, logp_old.detach(), advantages)
#   5. loss.backward(); optim.step()
import torch

def prm_score_trace(prm, tokenizer, steps):
    # steps: list of strings, one per reasoning step
    # Returns the PRM's per-step correctness scores.
    scores = []
    prefix = ""
    for step in steps:
        prefix += step + "\n"
        ids = tokenizer(prefix, return_tensors="pt").input_ids
        with torch.no_grad():
            # PRM head produces a scalar at the <step_end> token
            logit = prm(ids).logits[0, -1]
            scores.append(torch.sigmoid(logit).item())
    return scores

def beam_search_with_prm(model, prm, prompt, beam=4, max_steps=12):
    # Step-level beam search guided by a process reward model.
    beams = [(prompt, 0.0)]
    for _ in range(max_steps):
        candidates = []
        for trace, cum in beams:
            for step in model.sample_steps(trace, k=beam):
                s = prm_score_trace(prm, model.tokenizer, to_steps(trace + step))[-1]
                candidates.append((trace + step, cum + torch.log(torch.tensor(s))))
        candidates.sort(key=lambda x: -x[1])
        beams = candidates[:beam]
    return max(beams, key=lambda x: x[1])[0]

8. Summary

Further reading

  • Wei et al. (2022) — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
  • Wang et al. (2022) — Self-Consistency Improves Chain of Thought Reasoning.
  • Lightman et al. (2023) — Let's Verify Step by Step (PRM800K, the canonical PRM paper).
  • OpenAI (2024) — Learning to Reason with LLMs (o1 technical report).
  • DeepSeek-AI (2025) — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
NEXT UP
→ AI Safety & Alignment

The RL techniques that make reasoning work are the same ones that shape model alignment. Read on for how RLHF, DPO, and Constitutional AI relate to the reasoning RL stack.