Foundation Models

Pretrain once, adapt everywhere. The term was coined at Stanford in 2021 to name the biggest shift in ML practice since ImageNet: a single big pretrained model as the base of an entire ecosystem. This page covers the pretraining recipe, the scaling laws that guide it, and how you turn a base model into a useful application.

Prereq: transformers, cross-entropy Time to read: ~20 min Interactive figures: 1 Code: PyTorch, NumPy

1. What's a foundation model?

The term "foundation model" was introduced in a 2021 Stanford report by Bommasani, Liang et al. The claim was modest but load-bearing: training a single model on a huge, diverse dataset and then adapting it to many downstream tasks had become the dominant paradigm in NLP, vision, speech, and robotics — not just another technique. Everything now ran on top of a base.

Concretely, a foundation model has three properties:

Everything in 2024–25 is a foundation model. GPT, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek — language. DINOv2, SAM, SigLIP — vision. Whisper — speech. AlphaFold 3 — structural biology. The details differ; the recipe is the same.

2. The pretraining loss

For a decoder-only language model — the GPT lineage — the pretraining objective is next-token prediction over a corpus:

$$\mathcal{L}(\theta) = -\sum_{t=1}^{T} \log p_\theta(x_t \mid x_{

Next-token loss

$\theta$
The model's parameters (all of its weights).
$x_1, \dots, x_T$
A sequence of tokens (words, subwords, bytes). $T$ is the sequence length — typically thousands.
$x_{
Shorthand for $(x_1, x_2, \dots, x_{t-1})$ — all tokens before position $t$. The model's context window.
$p_\theta(x_t \mid x_{
The probability the model assigns to the actual next token, given the past. This comes from a softmax over the vocabulary at position $t$.
$-\log$
The negative log converts probability into "surprise" — it rewards the model for assigning high probability to what actually comes next.

Analogy You're reading a book with one word covered at a time and trying to guess it based on everything before. If you guess "the" right 90% of the time, you're only mildly surprised. If you say "xyzzy" and the real word is "and", you're very surprised. The loss is the average surprise across all the blanks in the book. Minimizing it is what teaches the model grammar, facts, style, and — eventually — the ability to follow instructions.

This objective looks trivial. It is trivial. Yet it is sufficient, at scale, to produce models that do translation, coding, math, analogical reasoning, and instruction following. Nothing in the loss function asks for any of that. It emerges from compression pressure: to predict the next token well on a sufficiently diverse corpus, you have to model the world that generated it.

3. Kaplan scaling laws (2020)

In 2020, Kaplan, McCandlish et al. at OpenAI did the first systematic study of how LLM loss depends on three knobs: model size $N$, dataset size $D$, and compute budget $C$. Their finding — a clean power law:

$$L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}$$

Kaplan scaling

$L$
Test loss (cross-entropy per token) on a held-out set.
$N$
Non-embedding parameter count.
$D$
Number of training tokens processed.
$N_c, D_c$
Empirical constants that set the scale. They absorb the specifics of the architecture and data mix.
$\alpha_N, \alpha_D$
Scaling exponents — Kaplan found $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$. Small positive numbers, so loss drops slowly but predictably with each doubling.

What's shocking The power law holds over seven orders of magnitude of compute. No bumps, no plateaus, no discontinuities. Double the compute; subtract a constant from the log-loss. This is why the 2020–2024 era was defined by "just make it bigger" — the curve kept going, and nobody could find the wall.

Kaplan's law also gave a formula for how to split a fixed compute budget between model size and dataset size. His conclusion: if compute doubles, scale up parameters more aggressively than tokens. GPT-3 (175B params, 300B tokens) was a textbook application.

4. Chinchilla (2022)

Two years later, Hoffmann et al. at DeepMind redid the experiment more carefully and got a different answer. They fit scaling curves at many $(N, D)$ pairs, held compute fixed, and found Kaplan had under-trained his biggest models:

$$L(N, D) \approx L_\infty + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$

With $\alpha \approx 0.34$, $\beta \approx 0.28$. The optimal split is roughly "tokens and parameters scale in proportion" — for a fixed compute budget $C = 6 N D$, you should have:

$$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$

Chinchilla

$C = 6ND$
The approximate compute (FLOPs) of training a decoder-only model with $N$ parameters on $D$ tokens. The factor 6 comes from 2 multiplies + 1 add per parameter in forward + backward.
$L_\infty$
Irreducible loss — the entropy of the data itself. No matter how big your model, you can't beat this floor.
$A/N^\alpha$
Parameter-scarcity term. Shrinks as $N$ grows.
$B/D^\beta$
Data-scarcity term. Shrinks as $D$ grows.
$\alpha, \beta \approx 0.3$
Much larger than Kaplan's $\sim$0.08. Chinchilla's curve is steeper, meaning each doubling helps more — and the optimal tradeoff is more balanced.

The Chinchilla rule of thumb: train a model on ~20 tokens per parameter. A 70B model should see ~1.4T tokens. GPT-3 at 300B tokens was under-trained for its size; Chinchilla at 70B params trained on 1.4T tokens beat it despite being 2.5× smaller. The result reshaped the industry — Llama, Mistral, and Gemma are all Chinchilla-style: smaller than GPT-3 but trained on far more tokens.

Since 2023 the pendulum has swung again: for inference-heavy deployments, even "over-training" well past the Chinchilla optimum is worth it, because a smaller overtrained model is cheaper to serve forever. Llama-3-8B trained on 15T tokens is the canonical 2024 example.

5. Interactive scaling calculator

Pick a compute budget (in FLOPs). The calculator applies Chinchilla's law to find the optimal $(N, D)$ split and estimates the resulting loss. Drag the slider and watch the frontier move.

Compute (log₁₀ FLOPs): 1e22

Chinchilla-optimal N and D for the selected compute budget. More compute → both axes grow together.

6. Emergent abilities

One of the most interesting and contested observations in the scaling literature: some capabilities appear suddenly as you scale, rather than improving smoothly. Arithmetic, multi-step reasoning, instruction following — on plots of accuracy vs. scale they look flat near random, then sharply rise past some threshold.

The caveat, from Schaeffer et al. (2023): many apparent emergent abilities disappear when you use smoother metrics. A graded metric like "partial credit on digit-level accuracy" shows continuous improvement where the brittle "exact match accuracy" shows a cliff. So the jumps are partly an artifact of the evaluation, not the model.

What's genuinely not an artifact: in-context learning, chain-of-thought reasoning, and tool use all require some minimum scale before they're even discoverable. The 125M parameter GPT-2 cannot be prompted into solving arithmetic no matter how you phrase the prompt. Something qualitative happens around 7B–70B parameters that makes those capabilities accessible.

7. Adaptation — turning a base into an app

A pretrained base model is not, by itself, a useful product. It will cheerfully continue any text, including things you didn't want it to say. Adaptation turns it into an assistant. Three standard layers:

  1. Supervised fine-tuning (SFT). Train on a few tens of thousands of high-quality instruction–response pairs. Cheap, effective, and mostly shapes tone and format.
  2. Preference optimization. Collect pairs of outputs $(y_w, y_l)$ where $y_w$ is preferred to $y_l$, and train the model to prefer $y_w$ over $y_l$. Flavors: RLHF (PPO on a learned reward), DPO (closed-form supervised objective), IPO, KTO. Shapes values, helpfulness, safety.
  3. Lightweight adapters. LoRA, QLoRA, IA³ — instead of updating all billions of parameters, train a tiny add-on module (~1% of params) for a specific task. The base stays frozen. You can serve thousands of LoRA adapters for different customers on a single shared base.

The division of labor is important: pretraining teaches the model the world. Fine-tuning teaches it the interface — how to be a chatbot, how to follow a format, what domain to privilege. It's much cheaper to do the second step; a 7B model can be fine-tuned on a single GPU in an afternoon.

8. Source code

A minimal next-token pretraining loop (no mixed precision, no distributed, no tricks), and a Chinchilla calculator.

foundation model · core pieces
import torch, torch.nn.functional as F

def pretrain_step(model, batch, optimizer):
    # batch: dict with input_ids (B, T+1)
    ids = batch["input_ids"]
    x = ids[:, :-1]                             # (B, T) — inputs
    y = ids[:, 1:]                              # (B, T) — targets, shifted

    logits = model(x)                              # (B, T, V)
    loss = F.cross_entropy(
        logits.reshape(-1, logits.size(-1)),
        y.reshape(-1),
    )
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    return loss.item()

# That's it. Loop this a few trillion times on the internet and a foundation
# model falls out. Every other detail is scale, data, and optimization tricks.
import math

def chinchilla_optimal(C):
    # Given compute budget C (FLOPs), return (N_opt, D_opt) per Chinchilla.
    # Rule of thumb: N ~ (C / 6)^0.5, D ~ (C / 6)^0.5, with D ≈ 20 N.
    # Under C = 6 N D, this gives N = sqrt(C / 120), D = 20 N.
    N_opt = math.sqrt(C / 120)
    D_opt = 20 * N_opt
    return N_opt, D_opt

def chinchilla_loss(N, D, Linf=1.69, A=406.4, B=410.7,
                              alpha=0.34, beta=0.28):
    # Predict loss from N and D using the Chinchilla paper's fit.
    return Linf + A / (N ** alpha) + B / (D ** beta)

# Example: 1e22 FLOPs (a mid-size pretraining run ca. 2023)
N, D = chinchilla_optimal(1e22)
print(f"N = {N/1e9:.1f}B params, D = {D/1e9:.0f}B tokens")
print(f"loss ≈ {chinchilla_loss(N, D):.3f}")
import torch, torch.nn as nn

class LoRALinear(nn.Module):
    # A drop-in replacement for nn.Linear that adds a low-rank update:
    #    y = (W + (alpha/r) * B @ A) x
    # The base W is frozen. Only A and B are trained, ~0.1% of params.
    def __init__(self, base: nn.Linear, r=8, alpha=16):
        super().__init__()
        self.base  = base
        self.base.weight.requires_grad_(False)
        if self.base.bias is not None:
            self.base.bias.requires_grad_(False)
        d_in, d_out = base.in_features, base.out_features
        self.A = nn.Parameter(torch.randn(r, d_in) * 0.01)
        self.B = nn.Parameter(torch.zeros(d_out, r))   # zero-init: starts as identity
        self.scale = alpha / r

    def forward(self, x):
        return self.base(x) + self.scale * (x @ self.A.T @ self.B.T)

9. Summary

  • A foundation model is a single self-supervised base, trained at massive scale, that acts as the starting point for many downstream applications.
  • The pretraining loss is almost trivially simple — next-token prediction for language, masked reconstruction for images, contrastive alignment for multimodal. Everything else is scale and data.
  • Kaplan (2020) found clean power-law scaling across 7 orders of magnitude of compute. Scale param-heavy.
  • Chinchilla (2022) fixed the recipe: scale parameters and tokens in roughly equal proportion. Rule of thumb — 20 tokens per parameter.
  • Post-2023 pendulum: over-train small models past Chinchilla-optimal if you're going to pay serving costs forever.
  • Adaptation layers — SFT, DPO/RLHF, LoRA — turn a base model into a useful application. The base is the expensive part, but you only have to do it once.

Further reading

  • Bommasani et al. (2021) — On the Opportunities and Risks of Foundation Models.
  • Kaplan et al. (2020) — Scaling Laws for Neural Language Models.
  • Hoffmann et al. (2022) — Training Compute-Optimal Large Language Models (Chinchilla).
  • Hu et al. (2021) — LoRA: Low-Rank Adaptation of Large Language Models.
  • Schaeffer et al. (2023) — Are Emergent Abilities of Large Language Models a Mirage?
NEXT UP
→ Retrieval-Augmented Generation

Scaling a foundation model is one way to give it more knowledge. Letting it look things up is the other. Read on for the retrieval-augmented approach.