Foundation Models

Pretrain once, adapt everywhere. The term was coined at Stanford in 2021 to name the biggest shift in ML practice since ImageNet: a single big pretrained model as the base of an entire ecosystem. This page covers the pretraining recipe, the scaling laws that guide it, and how you turn a base model into a useful application.

Prereq: transformers, cross-entropy Time to read: ~20 min Interactive figures: 1 Code: PyTorch, NumPy

1. What's a foundation model?

The term "foundation model" was introduced in a 2021 Stanford report by Bommasani, Liang et al. The claim was modest but load-bearing: training a single model on a huge, diverse dataset and then adapting it to many downstream tasks had become the dominant paradigm in NLP, vision, speech, and robotics — not just another technique. Everything now ran on top of a base.

Concretely, a foundation model has three properties:

Pretrained at scale. Trained on a dataset orders of magnitude larger than any single downstream task could provide. For language, that means hundreds of billions to trillions of tokens of web text.
Self-supervised. No labels. The training signal comes from the structure of the data itself — next-token prediction, masked reconstruction, contrastive image-text alignment.
Generalist. Good at a wide range of downstream tasks after light adaptation (fine-tuning, prompting, adapters) rather than one specific task.

Everything in 2024–25 is a foundation model. GPT, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek — language. DINOv2, SAM, SigLIP — vision. Whisper — speech. AlphaFold 3 — structural biology. The details differ; the recipe is the same.

2. The pretraining loss

For a decoder-only language model — the GPT lineage — the pretraining objective is next-token prediction over a corpus:

Foundation Models

1. What's a foundation model?

2. The pretraining loss

3. Kaplan scaling laws (2020)

4. Chinchilla (2022)

5. Interactive scaling calculator

6. Emergent abilities

7. Adaptation — turning a base into an app

8. Source code

9. Summary

Further reading