Generative Adversarial Networks
Two networks — a forger and a detective — locked in a zero-sum game, each training the other until the forgeries become indistinguishable from reality. Ian Goodfellow invented this in one night in 2014 and the idea ate computer vision for five years.
1. The forger and the inspector
Imagine a counterfeiter trying to print fake banknotes, and a bank inspector trying to catch fakes. Neither has any special training at first — the counterfeiter produces blurry smudges, the inspector rejects anything that isn't the exact right shade of green. But they interact. Every time the inspector catches a fake, the counterfeiter updates their technique: that shade of ink was wrong, that paper was too thin, that portrait was off by a millimeter. Every time a fake slips through, the inspector adjusts: watch the watermark, check the holograph, trust the serial number pattern less. After thousands of rounds, both have gotten good. The counterfeiter is producing bills that even a trained inspector can't reliably distinguish from real ones — and in the process, the forger has implicitly learned the full structure of what a real banknote looks like.
That scenario is a Generative Adversarial Network, almost word for word. Goodfellow's 2014 paper proposed replacing the counterfeiter with a neural network $G$ (the generator) and the inspector with another neural network $D$ (the discriminator), and letting them train each other. At the end, if everything works, $G$ can sample from a distribution that matches your training data — faces, paintings, shoes, bedrooms, anything — even though you never told it what "looking like a face" mathematically means.
Before GANs, generating images meant either explicitly modelling a pixel-level probability distribution (slow, blurry results) or training an autoencoder (also blurry, because L2 loss averages). GANs sidestepped the likelihood entirely and asked "can another network tell the difference?" That's a vastly more sensitive loss signal than "is the pixel value within 0.01 of the target?"
Source — GAN Generator
// MLP Generator: z in R^latent_dim -> fake image in R^output_dim
// Architecture: Linear -> ReLU -> Linear -> Tanh
function Generator(z, W1, b1, W2, b2):
h = relu(W1 @ z + b1) // hidden layer
x = tanh(W2 @ h + b2) // output in [-1, 1]
return x
// W1: shape (hidden_dim, latent_dim)
// b1: shape (hidden_dim,)
// W2: shape (output_dim, hidden_dim)
// b2: shape (output_dim,)
import numpy as np
def relu(x):
return np.maximum(0, x)
def tanh(x):
return np.tanh(x)
class Generator:
def __init__(self, latent_dim, hidden_dim, output_dim):
# Xavier initialisation
scale1 = np.sqrt(2.0 / latent_dim)
scale2 = np.sqrt(2.0 / hidden_dim)
self.W1 = np.random.randn(hidden_dim, latent_dim) * scale1
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(output_dim, hidden_dim) * scale2
self.b2 = np.zeros(output_dim)
def forward(self, z):
# z: (batch, latent_dim)
h = relu(z @ self.W1.T + self.b1) # (batch, hidden_dim)
x = tanh(h @ self.W2.T + self.b2) # (batch, output_dim)
return x
# Example
G = Generator(latent_dim=100, hidden_dim=256, output_dim=784)
z = np.random.randn(16, 100) # batch of 16 noise vectors
fake = G.forward(z) # (16, 784) fake images in [-1, 1]
// Minimal MLP generator in plain JS (no framework)
function relu(x) { return x.map(v => Math.max(0, v)); }
function tanh(x) { return x.map(v => Math.tanh(v)); }
// matMulAdd: y = x @ W^T + b
// x: Float64Array length n, W: Float64Array length (m*n), b: Float64Array length m
function matMulAdd(x, W, b, m, n) {
const y = new Float64Array(m);
for (let i = 0; i < m; i++) {
let s = b[i];
for (let j = 0; j < n; j++) s += W[i * n + j] * x[j];
y[i] = s;
}
return y;
}
function generatorForward(z, W1, b1, W2, b2, hiddenDim, outputDim, latentDim) {
const h = relu(Array.from(matMulAdd(z, W1, b1, hiddenDim, latentDim)));
const x = tanh(Array.from(matMulAdd(h, W2, b2, outputDim, hiddenDim)));
return x;
}
#include <math.h>
#include <stdlib.h>
static void mat_mul_add(const float *x, const float *W, const float *b,
float *y, int m, int n) {
for (int i = 0; i < m; i++) {
float s = b[i];
for (int j = 0; j < n; j++) s += W[i * n + j] * x[j];
y[i] = s;
}
}
/* Generator forward: z (latent_dim) -> fake (output_dim) */
void generator_forward(const float *z,
const float *W1, const float *b1,
const float *W2, const float *b2,
float *hidden, float *out,
int latent, int hidden_dim, int out_dim) {
mat_mul_add(z, W1, b1, hidden, hidden_dim, latent);
for (int i = 0; i < hidden_dim; i++)
hidden[i] = hidden[i] > 0.f ? hidden[i] : 0.f; /* ReLU */
mat_mul_add(hidden, W2, b2, out, out_dim, hidden_dim);
for (int i = 0; i < out_dim; i++)
out[i] = tanhf(out[i]); /* Tanh */
}
#include <vector>
#include <cmath>
#include <numeric>
struct Generator {
int latent, hidden, output;
std::vector<float> W1, b1, W2, b2;
Generator(int latent, int hidden, int output)
: latent(latent), hidden(hidden), output(output),
W1(hidden * latent), b1(hidden),
W2(output * hidden), b2(output) { /* init weights here */ }
std::vector<float> forward(const std::vector<float> &z) const {
std::vector<float> h(hidden), x(output);
for (int i = 0; i < hidden; i++) {
float s = b1[i];
for (int j = 0; j < latent; j++) s += W1[i * latent + j] * z[j];
h[i] = s > 0.f ? s : 0.f; // ReLU
}
for (int i = 0; i < output; i++) {
float s = b2[i];
for (int j = 0; j < hidden; j++) s += W2[i * hidden + j] * h[j];
x[i] = std::tanh(s); // Tanh
}
return x;
}
};
public class Generator {
private final int latentDim, hiddenDim, outputDim;
private final float[] W1, b1, W2, b2;
public Generator(int latentDim, int hiddenDim, int outputDim) {
this.latentDim = latentDim;
this.hiddenDim = hiddenDim;
this.outputDim = outputDim;
W1 = new float[hiddenDim * latentDim];
b1 = new float[hiddenDim];
W2 = new float[outputDim * hiddenDim];
b2 = new float[outputDim];
// TODO: initialise weights (Xavier / He)
}
public float[] forward(float[] z) {
float[] h = new float[hiddenDim];
for (int i = 0; i < hiddenDim; i++) {
float s = b1[i];
for (int j = 0; j < latentDim; j++) s += W1[i * latentDim + j] * z[j];
h[i] = Math.max(0f, s); // ReLU
}
float[] x = new float[outputDim];
for (int i = 0; i < outputDim; i++) {
float s = b2[i];
for (int j = 0; j < hiddenDim; j++) s += W2[i * hiddenDim + j] * h[j];
x[i] = (float) Math.tanh(s); // Tanh
}
return x;
}
}
package main
import "math"
type Generator struct {
LatentDim, HiddenDim, OutputDim int
W1, b1, W2, b2 []float64
}
func (g *Generator) Forward(z []float64) []float64 {
h := make([]float64, g.HiddenDim)
for i := range h {
s := g.b1[i]
for j, zj := range z {
s += g.W1[i*g.LatentDim+j] * zj
}
if s < 0 {
s = 0 // ReLU
}
h[i] = s
}
x := make([]float64, g.OutputDim)
for i := range x {
s := g.b2[i]
for j, hj := range h {
s += g.W2[i*g.HiddenDim+j] * hj
}
x[i] = math.Tanh(s) // Tanh
}
return x
}
2. The adversarial framework
Two networks:
- Generator $G$: takes a random vector $z \sim p_z$ (drawn from a simple prior — typically a unit Gaussian) and outputs a "fake" sample $G(z)$ of the same shape as a real data point (e.g. a 64×64×3 image).
- Discriminator $D$: takes a sample $x$ (either real from the dataset or fake from $G$) and outputs a single number $D(x) \in [0, 1]$ — its estimate of the probability that $x$ is real.
They have opposing goals. $D$ wants to output 1 on real samples and 0 on fake ones. $G$ wants to make $D$ output 1 on its fakes — that is, $G$ wants to fool $D$. You train them alternately: a few steps improving $D$, a few steps improving $G$, repeat.
Source — GAN Discriminator
// MLP Discriminator: image in R^input_dim -> probability in [0,1]
// Architecture: Linear -> LeakyReLU -> Linear -> Sigmoid
function Discriminator(x, W1, b1, W2, b2):
h = leaky_relu(W1 @ x + b1, alpha=0.2)
logit = W2 @ h + b2 // scalar
prob = sigmoid(logit) // in (0, 1)
return prob
import numpy as np
def leaky_relu(x, alpha=0.2):
return np.where(x > 0, x, alpha * x)
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
class Discriminator:
def __init__(self, input_dim, hidden_dim):
scale1 = np.sqrt(2.0 / input_dim)
scale2 = np.sqrt(2.0 / hidden_dim)
self.W1 = np.random.randn(hidden_dim, input_dim) * scale1
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(1, hidden_dim) * scale2
self.b2 = np.zeros(1)
def forward(self, x):
# x: (batch, input_dim)
h = leaky_relu(x @ self.W1.T + self.b1) # (batch, hidden_dim)
logit = h @ self.W2.T + self.b2 # (batch, 1)
return sigmoid(logit) # probability in (0, 1)
# Example
D = Discriminator(input_dim=784, hidden_dim=256)
images = np.random.randn(16, 784)
probs = D.forward(images) # (16, 1), each in (0, 1)
function leakyRelu(x, alpha = 0.2) {
return x.map(v => v > 0 ? v : alpha * v);
}
function sigmoid(x) {
return x.map(v => 1 / (1 + Math.exp(-v)));
}
function discriminatorForward(x, W1, b1, W2, b2, inputDim, hiddenDim) {
const h = leakyRelu(Array.from(matMulAdd(x, W1, b1, hiddenDim, inputDim)));
const lg = matMulAdd(h, W2, b2, 1, hiddenDim);
return sigmoid(Array.from(lg))[0]; // scalar probability
}
// matMulAdd defined in Generator snippet
#include <math.h>
/* Discriminator forward: x (input_dim) -> probability in (0,1) */
float discriminator_forward(const float *x,
const float *W1, const float *b1,
const float *W2, const float *b2,
float *hidden,
int input_dim, int hidden_dim) {
/* Layer 1: LeakyReLU */
for (int i = 0; i < hidden_dim; i++) {
float s = b1[i];
for (int j = 0; j < input_dim; j++) s += W1[i * input_dim + j] * x[j];
hidden[i] = s > 0.f ? s : 0.2f * s;
}
/* Layer 2: linear + sigmoid */
float logit = b2[0];
for (int j = 0; j < hidden_dim; j++) logit += W2[j] * hidden[j];
return 1.f / (1.f + expf(-logit));
}
#include <vector>
#include <cmath>
struct Discriminator {
int inputDim, hiddenDim;
std::vector<float> W1, b1, W2, b2;
Discriminator(int inputDim, int hiddenDim)
: inputDim(inputDim), hiddenDim(hiddenDim),
W1(hiddenDim * inputDim), b1(hiddenDim),
W2(hiddenDim), b2(1) { /* init weights */ }
float forward(const std::vector<float> &x) const {
std::vector<float> h(hiddenDim);
for (int i = 0; i < hiddenDim; i++) {
float s = b1[i];
for (int j = 0; j < inputDim; j++) s += W1[i * inputDim + j] * x[j];
h[i] = s > 0.f ? s : 0.2f * s; // Leaky ReLU
}
float logit = b2[0];
for (int j = 0; j < hiddenDim; j++) logit += W2[j] * h[j];
return 1.f / (1.f + std::exp(-logit)); // Sigmoid
}
};
public class Discriminator {
private final int inputDim, hiddenDim;
private final float[] W1, b1, W2, b2;
public Discriminator(int inputDim, int hiddenDim) {
this.inputDim = inputDim;
this.hiddenDim = hiddenDim;
W1 = new float[hiddenDim * inputDim];
b1 = new float[hiddenDim];
W2 = new float[hiddenDim];
b2 = new float[1];
}
public float forward(float[] x) {
float[] h = new float[hiddenDim];
for (int i = 0; i < hiddenDim; i++) {
float s = b1[i];
for (int j = 0; j < inputDim; j++) s += W1[i * inputDim + j] * x[j];
h[i] = s > 0f ? s : 0.2f * s; // Leaky ReLU
}
float logit = b2[0];
for (int j = 0; j < hiddenDim; j++) logit += W2[j] * h[j];
return (float)(1.0 / (1.0 + Math.exp(-logit))); // Sigmoid
}
}
package main
import "math"
type Discriminator struct {
InputDim, HiddenDim int
W1, b1, W2, b2 []float64
}
func (d *Discriminator) Forward(x []float64) float64 {
h := make([]float64, d.HiddenDim)
for i := range h {
s := d.b1[i]
for j, xj := range x {
s += d.W1[i*d.InputDim+j] * xj
}
if s < 0 {
s = 0.2 * s // Leaky ReLU
}
h[i] = s
}
logit := d.b2[0]
for j, hj := range h {
logit += d.W2[j] * hj
}
return 1.0 / (1.0 + math.Exp(-logit)) // Sigmoid
}
3. The minimax objective
Goodfellow formalized the game as a single saddle-point optimization:
Every piece of the objective
- $G$
- Generator network. Maps a noise vector $z$ to a sample. Usually built from transpose convolutions or an MLP.
- $D$
- Discriminator network. Maps a sample to a single scalar in $[0,1]$. Usually built from convolutions ending in a sigmoid.
- $p_{\text{data}}$
- The true data distribution — the thing you want to learn. You never see this explicitly; you only see samples from it (your training set).
- $p_z$
- A simple prior distribution over the noise input. Almost always $\mathcal{N}(0, I)$ or uniform $[-1, 1]^d$. It's "simple" because the generator does the hard work of pushing it into a complicated shape.
- $\mathbb{E}_{x \sim p_{\text{data}}}$
- Expectation over real data. In practice: average over a minibatch sampled from your training set.
- $\log D(x)$
- Discriminator's log-probability that a real sample is real. $D$ wants this near 0 (meaning $D(x) \approx 1$). Maximizing $\log D(x)$ pushes $D$'s output on real data toward 1.
- $\log(1 - D(G(z)))$
- Discriminator's log-probability that a fake sample is fake. $D$ wants $D(G(z)) \approx 0$, which makes $1 - D(G(z)) \approx 1$ and $\log(1 - D(G(z))) \approx 0$. $G$ wants the opposite — it wants $D(G(z)) \approx 1$, which makes this term very negative.
- $\min_G \max_D$
- $D$ plays first, maximizing $V$; $G$ plays second, minimizing the resulting value. In practice we just alternate gradient steps.
Analogy Think of a courtroom. $D$ is the judge scoring how "real" each piece of evidence looks (first term = score real evidence high, second term = score forgeries low). $G$ is the forger, and gets a grade based on how badly $D$ ranks the forgeries. Maximizing $V$ makes the judge more discerning; minimizing $V$ makes the forger's output look more like evidence the judge accepts. The Nash equilibrium is the point where the forger's output is good enough that the best possible judge can only guess 50/50.
Goodfellow proved that at the global optimum, $p_g = p_{\text{data}}$ — that is, the distribution induced by pushing $p_z$ through $G$ matches the real data distribution exactly — and $D \equiv 1/2$ everywhere, meaning the discriminator literally cannot tell the difference. Beautiful theory. In practice, reaching that fixed point is extraordinarily finicky.
For a fixed $G$, the $D$ that maximizes $V(D, G)$ is:
Substituting back, the generator ends up minimizing $2 \cdot \text{JSD}(p_{\text{data}} \| p_g) - 2\log 2$, where $\text{JSD}$ is the Jensen–Shannon divergence. So the generator is (in theory) doing gradient descent on a proper distribution-distance metric.
4. The training loop
The practical algorithm is a double loop:
for number_of_training_iterations:
# --- Discriminator phase ---
for k steps:
sample minibatch of m noise vectors z_1, ..., z_m ~ p_z
sample minibatch of m real samples x_1, ..., x_m ~ p_data
∇_θ_D (1/m) Σ [ log D(x_i) + log(1 - D(G(z_i))) ]
θ_D ← θ_D + lr * gradient # ascent on V
# --- Generator phase ---
sample minibatch of m noise vectors z_1, ..., z_m ~ p_z
∇_θ_G (1/m) Σ log(1 - D(G(z_i)))
θ_G ← θ_G - lr * gradient # descent on V
One subtlety: in practice, minimizing $\log(1 - D(G(z)))$ gives terrible gradients early in training, because when $G$ is bad, $D(G(z)) \approx 0$, and the derivative of $\log(1 - 0)$ is tiny. The standard fix — also from Goodfellow's paper — is to instead maximize $\log D(G(z))$ for the generator step. It optimizes the same fixed point but has much healthier gradients early on. This is called the non-saturating loss.
Interactive: 1-D GAN learning a bimodal distribution
Forget images — here's a GAN learning to match a 1-D distribution that's a mixture of two Gaussians. The generator is a single neural network mapping a scalar noise $z \sim \mathcal{N}(0, 1)$ to a scalar output. The histogram of its outputs (pink) should drift toward matching the real distribution (cyan) as you train. Press Step 100 → to run 100 training iterations. Watch how it often finds one mode before the other — a textbook mode collapse failure you can see happen live.
What to look for
Real distribution: $0.5 \cdot \mathcal{N}(-2, 0.5) + 0.5 \cdot \mathcal{N}(+2, 0.5)$. Two humps at $-2$ and $+2$.
At iter 0 the generator is a random MLP, so its outputs are a single blob near zero. Over the first few hundred iterations it should converge onto one of the two modes. Onto both modes is harder — if you see it commit hard to one side only, that's mode collapse, the classic GAN failure mode.
Source — GAN Training Loop
// Alternating D / G updates with binary cross-entropy loss
// BCE(y_hat, y) = -[ y*log(y_hat) + (1-y)*log(1-y_hat) ]
for epoch in 1..num_epochs:
for real_batch in dataloader:
// ---- Train Discriminator ----
z = sample_noise(batch_size, latent_dim) // z ~ N(0,I)
fake = G.forward(z) // generated samples
d_real = D.forward(real_batch)
d_fake = D.forward(fake) // detach G gradient
d_loss = BCE(d_real, ones) + BCE(d_fake, zeros)
D.backward(d_loss)
D.step()
// ---- Train Generator (non-saturating loss) ----
z = sample_noise(batch_size, latent_dim)
fake = G.forward(z)
d_fake = D.forward(fake)
g_loss = BCE(d_fake, ones) // fool D: label fakes as real
G.backward(g_loss)
G.step()
import numpy as np
# Assume Generator and Discriminator classes from above snippets.
# This is a pure-NumPy training loop (manual backprop sketch).
def bce(y_hat, y, eps=1e-8):
"""Binary cross-entropy, scalar."""
y_hat = np.clip(y_hat, eps, 1 - eps)
return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
def train_gan(G, D, data, num_epochs=100, batch_size=64,
latent_dim=100, lr=0.0002):
n = len(data)
for epoch in range(num_epochs):
idx = np.random.permutation(n)
for i in range(0, n - batch_size, batch_size):
real = data[idx[i:i + batch_size]] # (B, input_dim)
# --- Discriminator update ---
z = np.random.randn(batch_size, latent_dim)
fake = G.forward(z) # (B, input_dim) -- detach in autograd frameworks
d_real = D.forward(real) # (B, 1)
d_fake = D.forward(fake) # (B, 1)
d_loss = bce(d_real, np.ones_like(d_real)) \
+ bce(d_fake, np.zeros_like(d_fake))
# D.backward(d_loss); D.step() <-- framework-specific
# --- Generator update ---
z = np.random.randn(batch_size, latent_dim)
fake = G.forward(z)
d_fake = D.forward(fake)
g_loss = bce(d_fake, np.ones_like(d_fake)) # non-saturating
# G.backward(g_loss); G.step()
// Conceptual GAN training loop in JavaScript
// Assumes generatorForward / discriminatorForward from earlier snippets
function bce(yHat, y, eps = 1e-8) {
let loss = 0;
for (let i = 0; i < yHat.length; i++) {
const p = Math.min(Math.max(yHat[i], eps), 1 - eps);
loss -= y[i] * Math.log(p) + (1 - y[i]) * Math.log(1 - p);
}
return loss / yHat.length;
}
function trainGAN(G, D, data, epochs = 100, batchSize = 64, latentDim = 100) {
for (let epoch = 0; epoch < epochs; epoch++) {
for (let i = 0; i + batchSize <= data.length; i += batchSize) {
const real = data.slice(i, i + batchSize);
// Discriminator step
const z1 = sampleNoise(batchSize, latentDim);
const fake1 = z1.map(zi => G.forward(zi));
const dReal = real.map(x => D.forward(x));
const dFake = fake1.map(f => D.forward(f));
const dLoss = bce(dReal, Array(batchSize).fill(1))
+ bce(dFake, Array(batchSize).fill(0));
// D.backward(dLoss); D.step();
// Generator step (non-saturating)
const z2 = sampleNoise(batchSize, latentDim);
const fake2 = z2.map(zi => G.forward(zi));
const dFake2 = fake2.map(f => D.forward(f));
const gLoss = bce(dFake2, Array(batchSize).fill(1));
// G.backward(gLoss); G.step();
}
}
}
/* Conceptual GAN training loop in C (no autograd — shows structure only) */
#include <math.h>
#include <stdio.h>
float bce(const float *y_hat, const float *y, int n) {
float loss = 0.f;
for (int i = 0; i < n; i++) {
float p = y_hat[i] < 1e-7f ? 1e-7f : (y_hat[i] > 1-1e-7f ? 1-1e-7f : y_hat[i]);
loss -= y[i] * logf(p) + (1.f - y[i]) * logf(1.f - p);
}
return loss / n;
}
void train_step(/* G, D weights, gradients, optimiser state ... */
const float *real_batch, int batch, int latent, int input_dim) {
/* 1. Sample noise */
float z[batch * latent]; /* VLA or malloc in practice */
sample_normal(z, batch * latent);
/* 2. Generator forward -> fake */
float fake[batch * input_dim];
generator_forward_batch(z, fake, batch, latent, input_dim);
/* 3. Discriminator on real + fake */
float d_real[batch], d_fake[batch];
discriminator_forward_batch(real_batch, d_real, batch, input_dim);
discriminator_forward_batch(fake, d_fake, batch, input_dim);
/* 4. D loss and backward (framework-specific) */
float ones[batch], zeros[batch];
for (int i = 0; i < batch; i++) { ones[i] = 1.f; zeros[i] = 0.f; }
float d_loss = bce(d_real, ones, batch) + bce(d_fake, zeros, batch);
/* d_backward(d_loss); d_step(); */
/* 5. G step (non-saturating) */
/* generator_forward_batch(z_new, fake_new, ...); */
/* float g_loss = bce(d_fake_new, ones, batch); */
/* g_backward(g_loss); g_step(); */
}
#include <vector>
#include <cmath>
#include <algorithm>
float bce(const std::vector<float> &yHat,
const std::vector<float> &y, float eps = 1e-7f) {
float loss = 0.f;
for (size_t i = 0; i < yHat.size(); i++) {
float p = std::clamp(yHat[i], eps, 1.f - eps);
loss -= y[i] * std::log(p) + (1.f - y[i]) * std::log(1.f - p);
}
return loss / static_cast<float>(yHat.size());
}
void trainStep(Generator &G, Discriminator &D,
const std::vector<std::vector<float>> &realBatch,
int latentDim) {
int B = static_cast<int>(realBatch.size());
// Discriminator step
auto z = sampleNoise(B, latentDim);
auto fake = G.forwardBatch(z); // detach from G graph
auto dReal = D.forwardBatch(realBatch);
auto dFake = D.forwardBatch(fake);
float dLoss = bce(dReal, std::vector<float>(B, 1.f))
+ bce(dFake, std::vector<float>(B, 0.f));
// D.backward(dLoss); D.step();
// Generator step (non-saturating)
auto z2 = sampleNoise(B, latentDim);
auto fake2 = G.forwardBatch(z2);
auto dFake2 = D.forwardBatch(fake2);
float gLoss = bce(dFake2, std::vector<float>(B, 1.f));
// G.backward(gLoss); G.step();
}
public class GANTrainer {
static float bce(float[] yHat, float[] y) {
float loss = 0f;
for (int i = 0; i < yHat.length; i++) {
float p = Math.max(1e-7f, Math.min(1 - 1e-7f, yHat[i]));
loss -= y[i] * Math.log(p) + (1 - y[i]) * Math.log(1 - p);
}
return loss / yHat.length;
}
public static void trainStep(Generator G, Discriminator D,
float[][] realBatch, int latentDim) {
int B = realBatch.length;
float[][] z = sampleNoise(B, latentDim);
float[][] fake = G.forwardBatch(z); // (B, input_dim)
float[] dReal = D.forwardBatch(realBatch);
float[] dFake = D.forwardBatch(fake);
float[] ones = new float[B]; java.util.Arrays.fill(ones, 1f);
float[] zeros = new float[B]; java.util.Arrays.fill(zeros, 0f);
float dLoss = bce(dReal, ones) + bce(dFake, zeros);
// D.backward(dLoss); D.step();
float[][] z2 = sampleNoise(B, latentDim);
float[][] fake2 = G.forwardBatch(z2);
float[] dFake2 = D.forwardBatch(fake2);
float gLoss = bce(dFake2, ones);
// G.backward(gLoss); G.step();
}
}
package main
import (
"math"
"math/rand"
)
func bce(yHat, y []float64) float64 {
const eps = 1e-7
loss := 0.0
for i := range yHat {
p := math.Max(eps, math.Min(1-eps, yHat[i]))
loss -= y[i]*math.Log(p) + (1-y[i])*math.Log(1-p)
}
return loss / float64(len(yHat))
}
func trainStep(G *Generator, D *Discriminator,
realBatch [][]float64, latentDim int) {
B := len(realBatch)
ones := make([]float64, B); for i := range ones { ones[i] = 1 }
zeros := make([]float64, B)
// D step
z := sampleNoise(B, latentDim)
fake := G.ForwardBatch(z) // detach
dReal := D.ForwardBatch(realBatch)
dFake := D.ForwardBatch(fake)
dLoss := bce(dReal, ones) + bce(dFake, zeros)
_ = dLoss // D.Backward(dLoss); D.Step()
// G step
z2 := sampleNoise(B, latentDim)
fake2 := G.ForwardBatch(z2)
dFake2 := D.ForwardBatch(fake2)
gLoss := bce(dFake2, ones)
_ = gLoss // G.Backward(gLoss); G.Step()
}
func sampleNoise(B, dim int) [][]float64 {
z := make([][]float64, B)
for i := range z {
z[i] = make([]float64, dim)
for j := range z[i] { z[i][j] = rand.NormFloat64() }
}
return z
}
5. Why training is so hard
The GAN literature from 2015–2019 reads like a medical journal of chronic illnesses. Here are the main ones:
- Mode collapse. The generator finds one output (or a small set) that consistently fools $D$, and collapses all its noise inputs to that output. The discriminator eventually adapts and the generator jumps to another mode, and on and on. The generator never actually covers the full data distribution.
- Vanishing discriminator gradients. If $D$ gets too good too fast, $D(G(z))$ saturates at 0, and $\log(1 - D(G(z)))$ has zero slope. $G$ gets no learning signal and training stalls.
- Non-convergence / oscillation. The two networks chase each other around forever, neither improving. A classic symptom of minimax games: even simple games like $\min_x \max_y xy$ don't converge with gradient descent — they orbit.
- Hyperparameter fragility. A learning rate that worked yesterday may not work today. The ratio of $D$ to $G$ steps matters. The batch size matters. Whether you use batch norm matters. The dimension of $z$ matters.
6. Wasserstein GAN
Arjovsky, Chintala & Bottou (2017) traced many GAN pathologies back to one root cause: when $p_g$ and $p_{\text{data}}$ have disjoint support (which they almost always do in high dimensions — real image manifolds are very thin), the Jensen–Shannon divergence is a constant, so its gradient is zero everywhere $p_g$ could reasonably move. The loss literally gives no information about how to improve.
They proposed replacing JSD with the Wasserstein distance (Earth Mover's distance), which has a smooth gradient even when the two distributions don't overlap. By Kantorovich–Rubinstein duality:
Reading the WGAN objective
- $W$
- Wasserstein-1 distance between the two distributions — the minimum "cost" of transporting probability mass from $p_g$ to $p_{\text{data}}$, where cost is distance moved × mass moved.
- $f$
- The "critic" function. In WGAN, the discriminator is replaced by a 1-Lipschitz critic that outputs an unbounded real number (no sigmoid, no probability interpretation). A high $f$ score means "real-looking," a low score means "fake-looking."
- $\|f\|_L \le 1$
- 1-Lipschitz constraint: $|f(x) - f(y)| \le \|x - y\|$ for all $x, y$. Without this, $f$ could just output $+\infty$ on reals and $-\infty$ on fakes and the sup would be meaningless. The constraint bounds how fast the critic's opinion can change.
- $\sup_f$
- Supremum over all 1-Lipschitz functions. You can't enumerate them, so in practice you parameterize $f$ as a neural network and enforce Lipschitz-ness via weight clipping (original WGAN) or a gradient penalty (WGAN-GP).
Analogy Jensen–Shannon divergence asks "how often can I tell these distributions apart?" — a yes/no question that's useless when the answer is "always." Wasserstein asks "how much dirt do I need to move to turn pile A into pile B?" — which always has a finite answer even when the piles don't overlap. That's what gives WGAN its smooth, informative gradient.
WGAN and its successors (WGAN-GP, spectral-norm GAN, StyleGAN) stabilized training enough that GANs powered the first wave of photorealistic face generation. By 2018, StyleGAN could generate faces that fooled humans on the first glance. By 2019, non-existent people on thispersondoesnotexist.com had become a meme.
Source — Wasserstein Loss (WGAN)
// Wasserstein loss (critic has no sigmoid; outputs unbounded reals)
// Critic loss = E[D(fake)] - E[D(real)] (critic maximises the gap)
// Generator loss = -E[D(fake)] (generator minimises critic score on fakes)
//
// Enforce 1-Lipschitz via weight clipping (original WGAN):
// for each param p in critic: p = clip(p, -c, c)
//
// Or via gradient penalty (WGAN-GP):
// x_hat = alpha*real + (1-alpha)*fake, alpha ~ Uniform(0,1)
// gp = (||grad_x_hat D(x_hat)||_2 - 1)^2
// critic_loss = E[D(fake)] - E[D(real)] + lambda * gp
import numpy as np
# --- Weight-clipping WGAN ---
def wgan_critic_loss(d_real, d_fake):
"""d_real, d_fake: arrays of critic scores (no sigmoid)."""
return np.mean(d_fake) - np.mean(d_real) # critic wants to maximise gap
def wgan_generator_loss(d_fake):
return -np.mean(d_fake) # G wants critic to score fakes high
def clip_weights(params, c=0.01):
return [np.clip(p, -c, c) for p in params]
# --- Gradient penalty (WGAN-GP, NumPy sketch) ---
def gradient_penalty(D, real, fake, lam=10.0):
batch = real.shape[0]
alpha = np.random.uniform(0, 1, (batch, 1))
x_hat = alpha * real + (1 - alpha) * fake # interpolated samples
# In a real framework you'd compute grad of D(x_hat) w.r.t. x_hat
# Here we sketch the formula:
# grad = autograd_grad(D(x_hat), x_hat)
# grad_norm = np.linalg.norm(grad, axis=1)
# gp = lam * np.mean((grad_norm - 1) ** 2)
# return gp
pass
# Training step (WGAN-GP)
def wgan_gp_train_step(G, D, real_batch, latent_dim, n_critic=5, lam=10.0, lr=1e-4):
B = real_batch.shape[0]
# Train critic n_critic times per G step
for _ in range(n_critic):
z = np.random.randn(B, latent_dim)
fake = G.forward(z)
c_loss = wgan_critic_loss(D.forward(real_batch), D.forward(fake))
gp = gradient_penalty(D, real_batch, fake, lam)
# D.backward(c_loss + gp); D.step()
# Train generator once
z = np.random.randn(B, latent_dim)
fake = G.forward(z)
g_loss = wgan_generator_loss(D.forward(fake))
# G.backward(g_loss); G.step()
// WGAN loss functions in JavaScript (no autograd)
function wganCriticLoss(dReal, dFake) {
const meanReal = dReal.reduce((a, b) => a + b, 0) / dReal.length;
const meanFake = dFake.reduce((a, b) => a + b, 0) / dFake.length;
return meanFake - meanReal; // critic wants to maximise; optimiser minimises
}
function wganGeneratorLoss(dFake) {
return -dFake.reduce((a, b) => a + b, 0) / dFake.length;
}
function clipWeights(params, c = 0.01) {
return params.map(p => Math.min(Math.max(p, -c), c));
}
// Gradient penalty sketch (needs autograd in practice):
// function gradientPenalty(D, real, fake, lambda = 10) {
// const alpha = Math.random();
// const xHat = real.map((r, i) => alpha * r + (1 - alpha) * fake[i]);
// const grad = autograd(D, xHat); // framework-specific
// const norm = euclideanNorm(grad);
// return lambda * (norm - 1) ** 2;
// }
#include <math.h>
#include <stdlib.h>
float wgan_critic_loss(const float *d_real, const float *d_fake, int n) {
float sum_real = 0.f, sum_fake = 0.f;
for (int i = 0; i < n; i++) { sum_real += d_real[i]; sum_fake += d_fake[i]; }
return sum_fake / n - sum_real / n;
}
float wgan_generator_loss(const float *d_fake, int n) {
float s = 0.f;
for (int i = 0; i < n; i++) s += d_fake[i];
return -s / n;
}
/* Weight clipping after each critic update */
void clip_weights(float *params, int n, float c) {
for (int i = 0; i < n; i++) {
if (params[i] > c) params[i] = c;
else if (params[i] < -c) params[i] = -c;
}
}
#include <vector>
#include <numeric>
#include <algorithm>
float wganCriticLoss(const std::vector<float> &dReal,
const std::vector<float> &dFake) {
float mReal = std::accumulate(dReal.begin(), dReal.end(), 0.f) / dReal.size();
float mFake = std::accumulate(dFake.begin(), dFake.end(), 0.f) / dFake.size();
return mFake - mReal;
}
float wganGeneratorLoss(const std::vector<float> &dFake) {
return -std::accumulate(dFake.begin(), dFake.end(), 0.f) / dFake.size();
}
void clipWeights(std::vector<float> ¶ms, float c = 0.01f) {
for (auto &p : params) p = std::clamp(p, -c, c);
}
public class WGAN {
public static float criticLoss(float[] dReal, float[] dFake) {
float sumReal = 0f, sumFake = 0f;
for (float v : dReal) sumReal += v;
for (float v : dFake) sumFake += v;
return sumFake / dFake.length - sumReal / dReal.length;
}
public static float generatorLoss(float[] dFake) {
float sum = 0f;
for (float v : dFake) sum += v;
return -sum / dFake.length;
}
public static void clipWeights(float[] params, float c) {
for (int i = 0; i < params.length; i++) {
if (params[i] > c) params[i] = c;
else if (params[i] < -c) params[i] = -c;
}
}
}
package main
func wganCriticLoss(dReal, dFake []float64) float64 {
sumReal, sumFake := 0.0, 0.0
for _, v := range dReal { sumReal += v }
for _, v := range dFake { sumFake += v }
return sumFake/float64(len(dFake)) - sumReal/float64(len(dReal))
}
func wganGeneratorLoss(dFake []float64) float64 {
sum := 0.0
for _, v := range dFake { sum += v }
return -sum / float64(len(dFake))
}
func clipWeights(params []float64, c float64) {
for i, p := range params {
if p > c {
params[i] = c
} else if p < -c {
params[i] = -c
}
}
}
7. Source code
A minimal DCGAN training step in three languages. These omit boilerplate (dataset loading, checkpointing) to focus on the core loss computation.
Source — GAN Training Step (PyTorch / TF / JAX)
import torch, torch.nn as nn, torch.nn.functional as F
def train_step(G, D, real_batch, opt_G, opt_D, z_dim):
B = real_batch.size(0)
device = real_batch.device
real_lbl = torch.ones (B, 1, device=device)
fake_lbl = torch.zeros(B, 1, device=device)
# ---------- Discriminator step ----------
opt_D.zero_grad()
z = torch.randn(B, z_dim, device=device)
fake = G(z).detach() # don't backprop through G here
d_loss_real = F.binary_cross_entropy_with_logits(D(real_batch), real_lbl)
d_loss_fake = F.binary_cross_entropy_with_logits(D(fake), fake_lbl)
d_loss = d_loss_real + d_loss_fake
d_loss.backward()
opt_D.step()
# ---------- Generator step (non-saturating) ----------
opt_G.zero_grad()
z = torch.randn(B, z_dim, device=device)
fake = G(z)
g_loss = F.binary_cross_entropy_with_logits(D(fake), real_lbl) # flip labels!
g_loss.backward()
opt_G.step()
return d_loss.item(), g_loss.item()
import tensorflow as tf
bce = tf.keras.losses.BinaryCrossentropy(from_logits=True)
@tf.function
def train_step(G, D, real_batch, opt_G, opt_D, z_dim):
B = tf.shape(real_batch)[0]
z = tf.random.normal([B, z_dim])
with tf.GradientTape() as g_tape, tf.GradientTape() as d_tape:
fake = G(z, training=True)
d_real = D(real_batch, training=True)
d_fake = D(fake, training=True)
d_loss = bce(tf.ones_like(d_real), d_real) + \
bce(tf.zeros_like(d_fake), d_fake)
g_loss = bce(tf.ones_like(d_fake), d_fake) # non-saturating
d_grad = d_tape.gradient(d_loss, D.trainable_variables)
g_grad = g_tape.gradient(g_loss, G.trainable_variables)
opt_D.apply_gradients(zip(d_grad, D.trainable_variables))
opt_G.apply_gradients(zip(g_grad, G.trainable_variables))
return d_loss, g_loss
import jax, jax.numpy as jnp
import optax
from flax import linen as nn
def bce_from_logits(logits, targets):
return optax.sigmoid_binary_cross_entropy(logits, targets).mean()
def d_loss_fn(d_params, g_params, real, z):
fake = G.apply(g_params, z)
d_real = D.apply(d_params, real)
d_fake = D.apply(d_params, jax.lax.stop_gradient(fake))
return bce_from_logits(d_real, jnp.ones_like (d_real)) + \
bce_from_logits(d_fake, jnp.zeros_like(d_fake))
def g_loss_fn(g_params, d_params, z):
fake = G.apply(g_params, z)
d_fake = D.apply(d_params, fake)
return bce_from_logits(d_fake, jnp.ones_like(d_fake))
@jax.jit
def train_step(g_params, d_params, g_opt_state, d_opt_state, real, key):
z = jax.random.normal(key, (real.shape[0], z_dim))
d_grads = jax.grad(d_loss_fn)(d_params, g_params, real, z)
d_updates, d_opt_state = d_opt.update(d_grads, d_opt_state)
d_params = optax.apply_updates(d_params, d_updates)
g_grads = jax.grad(g_loss_fn)(g_params, d_params, z)
g_updates, g_opt_state = g_opt.update(g_grads, g_opt_state)
g_params = optax.apply_updates(g_params, g_updates)
return g_params, d_params, g_opt_state, d_opt_state
Source — GAN Evaluation (FID Sketch)
// Frechet Inception Distance (FID) — simplified sketch
// 1. Extract feature vectors from a pre-trained Inception-v3
// for N real images and N generated images.
// 2. Compute mean mu and covariance Sigma of each set.
// 3. FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2*sqrtm(Sigma_r @ Sigma_g))
//
// Lower FID = better quality and diversity.
function fid(real_features, fake_features):
mu_r, Sigma_r = mean(real_features), cov(real_features)
mu_g, Sigma_g = mean(fake_features), cov(fake_features)
diff = mu_r - mu_g
covmean = sqrtm(Sigma_r @ Sigma_g) // matrix square root
return dot(diff, diff) + trace(Sigma_r + Sigma_g - 2 * covmean)
import numpy as np
def compute_fid(real_features, fake_features):
"""
real_features, fake_features: (N, D) arrays of Inception feature vectors.
Returns scalar FID score (lower is better).
"""
mu_r = np.mean(real_features, axis=0)
mu_g = np.mean(fake_features, axis=0)
Sigma_r = np.cov(real_features, rowvar=False)
Sigma_g = np.cov(fake_features, rowvar=False)
diff = mu_r - mu_g
# Matrix square root via eigendecomposition
eigvals, eigvecs = np.linalg.eigh(Sigma_r @ Sigma_g)
eigvals = np.maximum(eigvals, 0) # numerical stability
sqrt_product = eigvecs @ np.diag(np.sqrt(eigvals)) @ eigvecs.T
fid = diff @ diff + np.trace(Sigma_r + Sigma_g - 2 * sqrt_product)
return float(np.real(fid))
# Usage (replace extract_features with real Inception-v3 call):
# real_feats = extract_features(real_images) # (N, 2048)
# fake_feats = extract_features(generated_imgs) # (N, 2048)
# score = compute_fid(real_feats, fake_feats)
// Simplified FID in JavaScript (no Inception; illustrates the formula)
function mean(features) {
const D = features[0].length, N = features.length;
const mu = new Float64Array(D);
for (const f of features) for (let d = 0; d < D; d++) mu[d] += f[d] / N;
return mu;
}
function covariance(features, mu) {
const D = mu.length, N = features.length;
const cov = Array.from({ length: D }, () => new Float64Array(D));
for (const f of features) {
for (let i = 0; i < D; i++) for (let j = 0; j < D; j++)
cov[i][j] += (f[i] - mu[i]) * (f[j] - mu[j]) / (N - 1);
}
return cov;
}
function fid(realFeatures, fakeFeatures) {
const muR = mean(realFeatures), muG = mean(fakeFeatures);
const sigR = covariance(realFeatures, muR);
const sigG = covariance(fakeFeatures, muG);
// Full FID requires matrix sqrt (sqrtm) — omitted here.
// Diagonal approximation for illustration:
let score = 0;
for (let d = 0; d < muR.length; d++) {
const diffSq = (muR[d] - muG[d]) ** 2;
const trTerm = sigR[d][d] + sigG[d][d] - 2 * Math.sqrt(sigR[d][d] * sigG[d][d]);
score += diffSq + trTerm;
}
return score;
}
#include <math.h>
#include <stdlib.h>
/* Diagonal approximation of FID (avoids matrix sqrt) */
float fid_diagonal(const float *mu_r, const float *mu_g,
const float *var_r, const float *var_g, int D) {
float score = 0.f;
for (int d = 0; d < D; d++) {
float diff = mu_r[d] - mu_g[d];
float cov_term = var_r[d] + var_g[d]
- 2.f * sqrtf(var_r[d] * var_g[d]);
score += diff * diff + cov_term;
}
return score;
}
/* Compute per-dimension mean and variance from feature matrix (N x D) */
void mean_var(const float *features, int N, int D,
float *mu, float *var) {
for (int d = 0; d < D; d++) mu[d] = 0.f;
for (int i = 0; i < N; i++)
for (int d = 0; d < D; d++) mu[d] += features[i * D + d] / N;
for (int d = 0; d < D; d++) var[d] = 0.f;
for (int i = 0; i < N; i++)
for (int d = 0; d < D; d++) {
float diff = features[i * D + d] - mu[d];
var[d] += diff * diff / (N - 1);
}
}
#include <vector>
#include <cmath>
#include <numeric>
// Diagonal FID approximation
float fidDiagonal(const std::vector<float> &muR,
const std::vector<float> &muG,
const std::vector<float> &varR,
const std::vector<float> &varG) {
float score = 0.f;
for (size_t d = 0; d < muR.size(); d++) {
float diff = muR[d] - muG[d];
float cov = varR[d] + varG[d] - 2.f * std::sqrt(varR[d] * varG[d]);
score += diff * diff + cov;
}
return score;
}
std::pair<std::vector<float>, std::vector<float>>
meanVar(const std::vector<std::vector<float>> &feats) {
int N = feats.size(), D = feats[0].size();
std::vector<float> mu(D, 0.f), var(D, 0.f);
for (auto &f : feats) for (int d = 0; d < D; d++) mu[d] += f[d] / N;
for (auto &f : feats) for (int d = 0; d < D; d++) {
float diff = f[d] - mu[d];
var[d] += diff * diff / (N - 1);
}
return {mu, var};
}
public class FID {
public static float[] mean(float[][] features) {
int N = features.length, D = features[0].length;
float[] mu = new float[D];
for (float[] f : features)
for (int d = 0; d < D; d++) mu[d] += f[d] / N;
return mu;
}
public static float[] variance(float[][] features, float[] mu) {
int N = features.length, D = mu.length;
float[] var = new float[D];
for (float[] f : features)
for (int d = 0; d < D; d++) {
float diff = f[d] - mu[d];
var[d] += diff * diff / (N - 1);
}
return var;
}
// Diagonal FID approximation
public static float fidDiagonal(float[] muR, float[] muG,
float[] varR, float[] varG) {
float score = 0f;
for (int d = 0; d < muR.length; d++) {
float diff = muR[d] - muG[d];
float cov = varR[d] + varG[d]
- 2f * (float) Math.sqrt(varR[d] * varG[d]);
score += diff * diff + cov;
}
return score;
}
}
package main
import "math"
func meanVec(feats [][]float64) []float64 {
D, N := len(feats[0]), float64(len(feats))
mu := make([]float64, D)
for _, f := range feats {
for d, v := range f { mu[d] += v / N }
}
return mu
}
func varVec(feats [][]float64, mu []float64) []float64 {
D := len(mu)
N := float64(len(feats) - 1)
v := make([]float64, D)
for _, f := range feats {
for d := range f {
diff := f[d] - mu[d]
v[d] += diff * diff / N
}
}
return v
}
// Diagonal FID approximation
func fidDiagonal(muR, muG, varR, varG []float64) float64 {
score := 0.0
for d := range muR {
diff := muR[d] - muG[d]
cov := varR[d] + varG[d] - 2*math.Sqrt(varR[d]*varG[d])
score += diff*diff + cov
}
return score
}
8. What GANs left behind
From roughly 2015 through 2020, GANs were the thing to do in generative computer vision. StyleGAN, StyleGAN2, and StyleGAN3 produced faces that were essentially indistinguishable from real photographs. BigGAN did class-conditional ImageNet generation. CycleGAN did unpaired image-to-image translation (horse ↔ zebra, photo ↔ Monet). Progressive GAN showed you could scale generation up to megapixel images by growing both networks during training.
Then, in 2020, Ho et al. published Denoising Diffusion Probabilistic Models, and in 2021–2022 diffusion models (DDPMs, DALL·E 2, Imagen, Stable Diffusion) quickly overtook GANs on almost every image-generation benchmark. The reasons are simple: diffusion training is stable (no minimax, no mode collapse, just a regression loss), diffusion models cover the full distribution (no mode collapse), and they scale more gracefully. Modern image generators are almost all diffusion-based.
GANs still win in a few niches — they're much faster at inference time (a single forward pass vs. diffusion's dozens of denoising steps), so they remain popular for real-time applications like face-swapping, super-resolution, and video effects. And the intuition of "adversarial training" — using one network as a loss for another — lives on in a thousand other places, from domain adaptation to robustness training. But the main line of image generation has moved on.
9. Summary
- A GAN pairs a generator (maps noise to fakes) with a discriminator (tries to distinguish real from fake). They train each other.
- The minimax objective $\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$ has its global optimum at $p_g = p_{\text{data}}$, $D \equiv 1/2$.
- In practice, use the non-saturating generator loss (maximize $\log D(G(z))$) for gradient health.
- Training is unstable: mode collapse, vanishing gradients, oscillation, hyperparameter fragility.
- WGAN replaces JSD with Wasserstein distance, giving smooth gradients even when distributions don't overlap.
- GANs dominated generative CV 2015–2020. Diffusion models displaced them for most image tasks by 2022, though GANs remain fastest at inference.
Further reading
- Goodfellow et al. (2014) — Generative Adversarial Nets.
- Radford, Metz & Chintala (2016) — DCGAN: Unsupervised Representation Learning with Deep Convolutional GANs.
- Arjovsky, Chintala & Bottou (2017) — Wasserstein GAN.
- Karras, Laine & Aila (2019) — StyleGAN: A Style-Based Generator Architecture.
- Brock, Donahue & Simonyan (2018) — Large Scale GAN Training (BigGAN).
- Zhu et al. (2017) — CycleGAN.