Diffusion Models

Start with pure noise. Repeatedly ask a neural network "what did this look like one step ago, before I added a little noise?" Do that a few hundred times and you have Stable Diffusion, DALL-E, Sora, Imagen.

Prereq: Gaussian distributions, basic probability Time to read: ~30 min Interactive figures: 2

1. Noise ↔ data

Imagine a clean image $x_0$ — say, a photograph of a cat. Now corrupt it with a tiny bit of Gaussian noise: $x_1 = x_0 + \epsilon_1$, where $\epsilon_1 \sim \mathcal{N}(0, \sigma^2 I)$. The result is still recognizable but slightly grainy. Do it again to get $x_2$. And again. After a few hundred steps of adding noise, all structure has been destroyed and $x_T$ is indistinguishable from pure Gaussian noise.

Now imagine a magical function $f_\theta$ that can reverse this one step: given $x_t$, it predicts $x_{t-1}$. Starting from pure noise $x_T \sim \mathcal{N}(0, I)$, we could apply $f_\theta$ $T$ times and arrive at a clean image $x_0$ — a brand new sample from the data distribution. That's the entire idea of diffusion models.

The trick is that you cannot learn $f_\theta$ by directly predicting $x_{t-1}$ from $x_t$ — at any single step the mapping is ambiguous, because the forward noise was stochastic. Instead, you train the network to predict the noise that was added. Given that, $x_{t-1}$ can be recovered analytically.

WHY IT'S MIRACULOUS

You only have to learn a single, fixed task: "given a noisy image and how noisy it is, estimate the noise." The same network handles all noise levels. At sampling time you apply it hundreds of times in a carefully designed sequence, and out comes a photorealistic image. There's no GAN-style adversarial training, no mode collapse, no complex balancing of losses — just denoising.

2. The forward process

The forward process is a Markov chain that adds a small amount of Gaussian noise at each step. It is fixed — no learned parameters. It's just a schedule.

Forward process

Fix a noise schedule $\beta_1, \beta_2, \dots, \beta_T \in (0, 1)$ with $\beta_t$ small. Define the transition:

$$q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\ \sqrt{1 - \beta_t}\, x_{t-1},\ \beta_t I\right)$$

In sampling form:

$$x_t = \sqrt{1 - \beta_t}\, x_{t-1} + \sqrt{\beta_t}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I)$$

The factor $\sqrt{1 - \beta_t}$ in front of $x_{t-1}$ is a careful choice. It ensures that if $x_{t-1}$ has unit variance then $x_t$ also has unit variance — the scale is preserved as noise accumulates. Without it, the variance would grow unboundedly and $x_T$ would be an unbounded Gaussian instead of a standard one.

Interactive: watch noise accumulate

A tiny 10×10 "smiley" image gets corrupted one step at a time. Adjust the schedule and starting timestep to see how quickly structure is destroyed. Press Step → to add one step of noise.

▸ Forward noising · 10×10 image t = 0 / 20
x_t (current) stats t = 0 β_t = ᾱ_t = 1.000 (signal retention) noise std: 0.000 signal std: 1.000
READY

Step 0 is the original image. Each step multiplies the current state by $\sqrt{1-\beta_t}$ and adds Gaussian noise of scale $\sqrt{\beta_t}$.

Press Step → to begin.

3. Closed form: jumping to any step

Applying the forward chain step-by-step works, but for training we need something faster. Here's the miracle: the noise accumulated over $t$ steps is itself Gaussian, and we can write $x_t$ directly in terms of $x_0$ without iterating.

Let $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$. Then:

Closed-form forward
$$q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\ \sqrt{\bar{\alpha}_t}\, x_0,\ (1 - \bar{\alpha}_t) I\right)$$

Or in sampling form:

$$x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I)$$
Quick derivation (telescoping)

$x_t = \sqrt{\alpha_t}\, x_{t-1} + \sqrt{1 - \alpha_t}\, \epsilon_{t}$

Substitute $x_{t-1} = \sqrt{\alpha_{t-1}}\, x_{t-2} + \sqrt{1 - \alpha_{t-1}}\, \epsilon_{t-1}$:

$x_t = \sqrt{\alpha_t \alpha_{t-1}}\, x_{t-2} + \sqrt{\alpha_t(1 - \alpha_{t-1})}\, \epsilon_{t-1} + \sqrt{1-\alpha_t}\, \epsilon_t$

The sum of independent Gaussians is Gaussian, and the variance is the sum of squared coefficients: $\alpha_t(1 - \alpha_{t-1}) + (1 - \alpha_t) = 1 - \alpha_t \alpha_{t-1}$.

Continuing telescopically, after $t$ substitutions we arrive at $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \bar{\epsilon}$ with $\bar{\epsilon} \sim \mathcal{N}(0, I)$. $\blacksquare$

This means at training time we can sample a random $t \in \{1, \dots, T\}$ and compute $x_t$ directly in one line. No iteration. This single fact is what makes diffusion models trainable at scale.

Signal vs. noise at different $t$

With a linear schedule $\beta_t$ from 0.0001 to 0.02 over $T = 1000$ steps: $\bar{\alpha}_{100} \approx 0.985$, $\bar{\alpha}_{500} \approx 0.21$, $\bar{\alpha}_{999} \approx 4 \times 10^{-5}$.

At $t = 500$, the image is roughly $\sqrt{0.21} \approx 46\%$ signal and $\sqrt{0.79} \approx 89\%$ noise (by standard deviation). By $t = 999$ the signal has essentially vanished.

4. The reverse process

We want to learn $p_\theta(x_{t-1} \mid x_t)$ — a model that, given a noisy image, tells us how to denoise it by one step. If $\beta_t$ is small, the true reverse process is also approximately Gaussian, and we parameterize it as:

$$p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\ \mu_\theta(x_t, t),\ \Sigma_\theta(x_t, t))$$

In the simplest formulation (DDPM, Ho et al., 2020), we fix the variance $\Sigma_\theta$ to a known schedule and only learn the mean $\mu_\theta$. But instead of directly predicting the mean, we predict the noise.

Mean from predicted noise

Let $\epsilon_\theta(x_t, t)$ be a neural network that predicts the noise component of $x_t$. Then the reverse mean is:

$$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\, \epsilon_\theta(x_t, t) \right)$$

This formula comes out of doing Bayes' theorem on the forward and reverse distributions and substituting the closed-form forward. If you want to work it out — start from $q(x_{t-1} \mid x_t, x_0)$ which is Gaussian (computable exactly using Bayes), express $x_0$ in terms of $x_t$ and $\epsilon$ using the forward closed form, and simplify. You arrive at the formula above.

5. The training objective

Diffusion models were originally motivated as variational autoencoders with a very long chain of latents, and the original derivation minimizes an ELBO (evidence lower bound). The algebra is gnarly but the final result is shockingly clean:

Simplified DDPM loss
$$\mathcal{L} = \mathbb{E}_{t,\, x_0,\, \epsilon} \left[ \left\| \epsilon - \epsilon_\theta\!\left(\sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon,\ t\right) \right\|^2 \right]$$

Decoded:

  1. Sample a training image $x_0$.
  2. Sample a random timestep $t \in \{1, \dots, T\}$.
  3. Sample Gaussian noise $\epsilon \sim \mathcal{N}(0, I)$.
  4. Form the noisy image $x_t = \sqrt{\bar{\alpha}_t}\, x_0 + \sqrt{1 - \bar{\alpha}_t}\, \epsilon$.
  5. Feed $(x_t, t)$ into the network $\epsilon_\theta$, which outputs a noise prediction.
  6. Loss: squared error between predicted and actual noise.

That's it. It is a supervised regression task — the targets are the noise we just added ourselves. The network learns to estimate the noise content of an image for every noise level in the schedule.

THE SIMPLICITY TRICK

The full ELBO objective includes weighting terms that depend on $t$. Ho et al. noticed that dropping these weights (uniform weighting) yields better samples empirically. So the real loss is just MSE of noise prediction. No fancy divergences, no adversarial games.

The network architecture

In practice $\epsilon_\theta$ is a U-Net: an encoder-decoder CNN with skip connections between corresponding encoder and decoder blocks. The encoder downsamples the image through several resolutions, building up feature maps; the decoder upsamples back to the original size, using the skip connections to recover fine detail. The timestep $t$ is embedded (typically with sinusoidal encoding + MLP) and added to every layer's activations so the network knows how much noise to look for.

Modern large-scale diffusion models often use transformer-based backbones instead of U-Nets (DiT, MMDiT, etc.) — but the training procedure is the same.

6. Sampling (DDPM)

Training over. Now we want to generate new images. Start from pure noise $x_T \sim \mathcal{N}(0, I)$ and iterate the reverse step:

DDPM SAMPLING (pseudocode)
x = sample from N(0, I)             # start from noise
for t in T, T-1, ..., 1:
    z = sample from N(0, I) if t > 1 else 0
    pred_eps = eps_theta(x, t)       # network's noise prediction
    x = (1 / sqrt(alpha_t)) *
        (x - ((1 - alpha_t) / sqrt(1 - alpha_bar_t)) * pred_eps) +
        sigma_t * z
return x                             # x_0: the generated image

The extra $\sigma_t z$ term is the added stochasticity: the reverse process is itself a noisy Markov chain. Each reverse step denoises by a little and reintroduces a tiny bit of fresh noise. Without it, the chain would collapse to something deterministic and the samples would all look similar. With it, each sampling run produces a different image.

Interactive: watch an image emerge from noise

For educational purposes, this demo does a simulated denoising: it starts from pure noise and reveals a hidden image in proportion to $\sqrt{\bar{\alpha}_t}$. The actual DDPM sampling equations look like this when the model's noise predictions are perfect — what you see is the trajectory a well-trained diffusion model would produce.

▸ Reverse denoising · noise → image t = T
current x_t stats t = T signal √ᾱ_t = 0 noise √(1−ᾱ_t) = 1 progress:
READY

Starting state is pure noise. Each step reduces the noise contribution and reveals more of the signal — exactly what a trained denoiser would do.

Press Step → to take one denoising step.

7. Noise schedules

The schedule $\{\beta_t\}$ is a critical design choice. Different schedules make the forward process destroy information at different rates, which changes which noise levels the model spends effort on.

Linear

$\beta_t$ linearly interpolates between $\beta_1 = 10^{-4}$ and $\beta_T = 0.02$. The original DDPM schedule. Simple but wastes capacity on near-pure-noise timesteps that are already easy to denoise.

Cosine

Introduced by Nichol & Dhariwal (2021). $\bar{\alpha}_t$ follows a cosine curve so the noise level changes slowly near $t = 0$ and $t = T$. Better sample quality, especially at low step counts.

Variance-preserving (VP)

The family DDPM belongs to: $\sqrt{\alpha_t}$ shrinks and $\sqrt{\beta_t}$ noise is added, preserving unit variance overall.

Variance-exploding (VE)

Alternative formulation (Song & Ermon). $x_t = x_0 + \sigma_t \epsilon$ with increasing $\sigma_t$. Variance grows unboundedly. Equivalent expressive power, different implementation details.

8. DDIM and faster samplers

Vanilla DDPM sampling needs $T \approx 1000$ forward passes through the network to produce one image. For a large model that's slow — several seconds or minutes per sample. An enormous amount of research has gone into cutting down the step count without losing quality.

9. Conditional generation and classifier-free guidance

So far we have described unconditional generation — sampling from $p(x)$ of the training distribution. For text-to-image models we want $p(x \mid \text{text})$. There are two main ways to condition a diffusion model on text.

Classifier guidance

Train a classifier $p_\phi(\text{text} \mid x_t)$ that predicts text captions from noisy images. At sampling time, add the gradient of the classifier's log-probability to the noise prediction, pushing the sample toward images the classifier thinks match the prompt. Works, but requires training a separate classifier on noisy images — annoying.

Classifier-free guidance (CFG)

The method that actually made text-to-image work. Train a single model that is jointly conditional and unconditional: with some probability (say 10%) drop the conditioning during training, so the model learns both $\epsilon_\theta(x_t, t, c)$ and $\epsilon_\theta(x_t, t, \varnothing)$. At sampling time, compute both and extrapolate:

$$\tilde{\epsilon} = \epsilon_\theta(x_t, t, \varnothing) + w \left[\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing)\right]$$

where $w$ is the guidance scale. $w = 1$ is plain conditional sampling; $w > 1$ exaggerates the influence of the prompt. Increasing $w$ produces images that more faithfully match the prompt at the cost of some realism. Typical values: $w = 7$ for Stable Diffusion.

10. Latent diffusion

Running diffusion directly on $512 \times 512 \times 3$ images is expensive — the U-Net has to process tensors with 786,000 values at every step, $\times$ hundreds of steps. Latent diffusion (Rombach et al., 2022) fixes this by first compressing images into a much smaller latent space using a pretrained autoencoder, then running diffusion in that latent space.

  1. Train a VAE: encoder $E$ maps $x \mapsto z$ (e.g. $512^2 \times 3 \to 64^2 \times 4$ — about 48× smaller), decoder $D$ maps $z \mapsto \hat{x}$.
  2. Train the diffusion model in $z$-space: add noise to $z_0 = E(x_0)$, predict the noise, etc. The U-Net is now much smaller.
  3. At sampling: generate $z_0$ via DDPM/DDIM, then decode $x_0 = D(z_0)$.

This is how Stable Diffusion works. The diffusion U-Net operates on $64 \times 64 \times 4$ latents (~16k values) regardless of the final image size. Text conditioning is injected through cross-attention with embeddings from a CLIP or T5 text encoder. The result: state-of-the-art image generation on consumer GPUs.

The zoo, briefly

11. Summary

ONE-PARAGRAPH SUMMARY

Diffusion models define a fixed forward process that gradually corrupts data into Gaussian noise, and learn a neural network that reverses one step of that corruption at a time. The training objective reduces to a simple MSE regression: predict the noise in a partially-noised image. Sampling starts from pure noise and iterates the learned denoiser hundreds of times, producing a fresh sample from the data distribution. Latent diffusion shrinks the compute by diffusing a compressed latent representation, and classifier-free guidance turns the whole thing into a text-to-image system. All modern image, video, and audio generators — Stable Diffusion, DALL-E, Imagen, Sora — are built on this recipe.

Where to go next