Backpropagation

The algorithm that actually trains neural networks — derived carefully, worked through with real numbers, and traced step-by-step in an interactive tiny network you can run in your browser.

Prereq: calculus (chain rule), basic linear algebra Time to read: ~30 min Interactive figures: 3

1. Why we need it

A neural network is just a function. You hand it an input $x$, it produces a prediction $\hat{y}$. That function has knobs — millions or billions of them — called weights. The magic of training is finding the setting of those knobs that makes the network's predictions match the correct answers on some dataset.

Here's the problem in one sentence: given a scalar loss $L$ (how wrong the network was), how do we figure out — for every single weight $w$ in the network — which direction to nudge it to make $L$ smaller?

Mathematically, we need the partial derivative $\partial L / \partial w$ for every weight. With that, we can update each weight using gradient descent:

w \leftarrow w - \eta \, \frac{\partial L}{\partial w}

where $\eta$ is a small positive number called the learning rate. The key word in that equation is every. A modern model has hundreds of billions of weights. Naively computing each gradient by finite differences would require a separate forward pass per weight — that's a hundred billion forward passes per training step. Obviously hopeless.

THE INSIGHT

Backpropagation computes all partial derivatives of a scalar output with respect to all inputs to a computation graph in a single backward pass whose cost is roughly equal to one forward pass. One forward + one backward = two passes, regardless of how many billion parameters you have. That is the reason deep learning works at scale.

Backprop was popularized in 1986 by Rumelhart, Hinton, and Williams, though variants had been discovered independently several times before. It's just the chain rule from calculus, applied cleverly. Let's build it from scratch.

2. The chain rule refresher

If you have a composition $f(g(x))$, the chain rule tells you how to differentiate it:

\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

For a longer chain $f(g(h(x)))$:

\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}

You multiply the local derivatives along the chain. That's the entire conceptual content of backpropagation. A neural network is a very long composition of functions — each layer is a function applied to the previous layer's output — so computing $\partial L / \partial w$ for some weight deep in the network is just multiplying local derivatives from the loss back to that weight.

Warm-up example

Let $f(x) = \sin(x^2 + 1)$. Set $u = x^2 + 1$, so $f = \sin(u)$.

$\frac{df}{du} = \cos(u)$, $\quad \frac{du}{dx} = 2x$

$\frac{df}{dx} = \cos(x^2+1) \cdot 2x$

Backpropagation does this same multiplication, but through a computation graph with thousands of nodes instead of two.

3. Computation graphs

Any expression can be drawn as a directed graph where nodes are operations and edges carry intermediate values. For $f(x,y) = (x+y) \cdot y$:

x, y \to a = x + y \to f = a \cdot y

To compute gradients we mark each edge with the local partial derivative and then trace backward from $f$ to whatever variable we want. The chain rule says: the derivative of $f$ with respect to a variable is the sum over all paths from that variable to $f$ of the product of local derivatives along the path.

Neural networks, even billion-parameter ones, are just large computation graphs. Backpropagation = "sum over paths" implemented efficiently by traversing the graph once in reverse.

4. Our running network

To keep everything concrete, we'll use a minimal network with 2 inputs, 2 hidden units, and 1 output — exactly what you need to learn XOR. Every equation we derive generalizes directly to larger networks.

A 2-2-1 fully-connected network. Two input features, two hidden units, one output. Every line is a learnable weight; every hidden/output neuron also has an implicit bias.

Notation

Superscripts denote layer: $(1)$ is the hidden layer, $(2)$ is the output layer. $z$ denotes the pre-activation (the linear combination $w \cdot x + b$), $h$ or $\hat{y}$ denotes the activation (the output of $\sigma(z)$).

\begin{aligned} z^{(1)}_1 &= w^{(1)}_{11} x_1 + w^{(1)}_{21} x_2 + b^{(1)}_1 & h_1 &= \sigma(z^{(1)}_1) \\ z^{(1)}_2 &= w^{(1)}_{12} x_1 + w^{(1)}_{22} x_2 + b^{(1)}_2 & h_2 &= \sigma(z^{(1)}_2) \\ z^{(2)} &= w^{(2)}_{1} h_1 + w^{(2)}_{2} h_2 + b^{(2)} & \hat{y} &= \sigma(z^{(2)}) \end{aligned}

For the activation we'll use the classic logistic sigmoid:

\sigma(z) = \frac{1}{1 + e^{-z}}, \qquad \sigma'(z) = \sigma(z)(1 - \sigma(z))

The fact that $\sigma'$ has such a clean form in terms of $\sigma$ itself is why it was the default activation for decades. Modern networks use ReLU mostly — we'll note how ReLU changes things at the end.

5. Forward pass

Computing $\hat{y}$ from $x$ is just evaluating the equations above top to bottom. No calculus yet — it's plain arithmetic.

Running numbers

We'll use these inputs and initial weights throughout the page (you can change them in the interactive figure below):

$x = [0.5,\ 0.8]$, target $y = 1$

$W^{(1)} = \begin{bmatrix} 0.2 & 0.4 \\ 0.3 & 0.1 \end{bmatrix}$, $b^{(1)} = [0.1,\ 0.1]$

$w^{(2)} = [0.5,\ 0.6]$, $b^{(2)} = 0.2$

Interactive: step through the forward pass

Press Step → to compute each value one at a time. Edit any number in the parameter panel and the demo will restart so you can watch how the forward pass changes.

▸ Forward pass (2-2-1 network) Step 0 / 6

READY

Press Step → to compute each pre-activation and activation in order.

x₁ x₂ y w¹₁₁ w¹₁₂ w¹₂₁ w¹₂₂ b¹₁ b¹₂ w²₁ w²₂ b²

Source — Neural Net Forward Pass

function forward(X, W1, b1, W2, b2):
    # Layer 1
    Z1 = W1 * X + b1          # shape: (hidden, 1)
    H1 = sigmoid(Z1)          # element-wise activation

    # Layer 2 (output)
    Z2 = W2 * H1 + b2         # shape: (1, 1)
    Y_hat = sigmoid(Z2)       # scalar output

    return Y_hat, H1, Z1, Z2

function sigmoid(z):
    return 1 / (1 + exp(-z))

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def forward(X, W1, b1, W2, b2):
    """
    X:  (n_in,)  input vector
    W1: (n_hid, n_in)  first-layer weights
    b1: (n_hid,)       first-layer biases
    W2: (n_out, n_hid) second-layer weights
    b2: (n_out,)       second-layer biases
    """
    Z1 = W1 @ X + b1          # (n_hid,)
    H1 = sigmoid(Z1)           # (n_hid,)
    Z2 = W2 @ H1 + b2          # (n_out,)
    Y_hat = sigmoid(Z2)        # (n_out,)
    return Y_hat, H1, Z1, Z2

# Example: 2-2-1 network
W1 = np.array([[0.2, 0.4], [0.3, 0.1]])
b1 = np.array([0.1, 0.1])
W2 = np.array([[0.5, 0.6]])
b2 = np.array([0.2])
X  = np.array([0.5, 0.8])

Y_hat, H1, Z1, Z2 = forward(X, W1, b1, W2, b2)
print(f"H1={H1}, Y_hat={Y_hat}")

function sigmoid(z) {
  return 1 / (1 + Math.exp(-z));
}

// Matrix-vector multiply: A (rows x cols) times v (cols,) -> out (rows,)
function matVec(A, v) {
  return A.map(row => row.reduce((s, a, j) => s + a * v[j], 0));
}

function forward(X, W1, b1, W2, b2) {
  const Z1 = matVec(W1, X).map((z, i) => z + b1[i]);
  const H1 = Z1.map(sigmoid);
  const Z2 = matVec(W2, H1).map((z, i) => z + b2[i]);
  const Yhat = Z2.map(sigmoid);
  return { Yhat, H1, Z1, Z2 };
}

// Example: 2-2-1 network
const W1 = [[0.2, 0.4], [0.3, 0.1]];
const b1 = [0.1, 0.1];
const W2 = [[0.5, 0.6]];
const b2 = [0.2];
const X  = [0.5, 0.8];

const { Yhat, H1 } = forward(X, W1, b1, W2, b2);
console.log("H1:", H1, "Yhat:", Yhat);

#include <stdio.h>
#include <math.h>

#define N_IN  2
#define N_HID 2
#define N_OUT 1

double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }

void forward(
    double X[N_IN],
    double W1[N_HID][N_IN], double b1[N_HID],
    double W2[N_OUT][N_HID], double b2[N_OUT],
    double Z1[N_HID], double H1[N_HID],
    double Z2[N_OUT], double Yhat[N_OUT])
{
    for (int i = 0; i < N_HID; i++) {
        Z1[i] = b1[i];
        for (int j = 0; j < N_IN; j++) Z1[i] += W1[i][j] * X[j];
        H1[i] = sigmoid(Z1[i]);
    }
    for (int i = 0; i < N_OUT; i++) {
        Z2[i] = b2[i];
        for (int j = 0; j < N_HID; j++) Z2[i] += W2[i][j] * H1[j];
        Yhat[i] = sigmoid(Z2[i]);
    }
}

int main(void) {
    double X[N_IN]           = {0.5, 0.8};
    double W1[N_HID][N_IN]   = {{0.2, 0.4}, {0.3, 0.1}};
    double b1[N_HID]         = {0.1, 0.1};
    double W2[N_OUT][N_HID]  = {{0.5, 0.6}};
    double b2[N_OUT]         = {0.2};
    double Z1[N_HID], H1[N_HID], Z2[N_OUT], Yhat[N_OUT];

    forward(X, W1, b1, W2, b2, Z1, H1, Z2, Yhat);
    printf("H1=[%.4f, %.4f]  Yhat=%.4f\n", H1[0], H1[1], Yhat[0]);
    return 0;
}

#include <cmath>
#include <vector>
#include <iostream>

double sigmoid(double z) { return 1.0 / (1.0 + std::exp(-z)); }

struct ForwardResult {
    std::vector<double> Yhat, H1, Z1, Z2;
};

ForwardResult forward(
    const std::vector<double>& X,
    const std::vector<std::vector<double>>& W1,
    const std::vector<double>& b1,
    const std::vector<std::vector<double>>& W2,
    const std::vector<double>& b2)
{
    int nHid = W1.size(), nIn = X.size(), nOut = W2.size();
    std::vector<double> Z1(nHid), H1(nHid), Z2(nOut), Yhat(nOut);

    for (int i = 0; i < nHid; i++) {
        Z1[i] = b1[i];
        for (int j = 0; j < nIn; j++) Z1[i] += W1[i][j] * X[j];
        H1[i] = sigmoid(Z1[i]);
    }
    for (int i = 0; i < nOut; i++) {
        Z2[i] = b2[i];
        for (int j = 0; j < nHid; j++) Z2[i] += W2[i][j] * H1[j];
        Yhat[i] = sigmoid(Z2[i]);
    }
    return {Yhat, H1, Z1, Z2};
}

int main() {
    auto [Yhat, H1, Z1, Z2] = forward(
        {0.5, 0.8},
        {{0.2, 0.4}, {0.3, 0.1}}, {0.1, 0.1},
        {{0.5, 0.6}}, {0.2}
    );
    std::cout << "H1=[" << H1[0] << ", " << H1[1] << "]  Yhat=" << Yhat[0] << "\n";
}

public class ForwardPass {

    static double sigmoid(double z) { return 1.0 / (1.0 + Math.exp(-z)); }

    // Returns [Yhat, H1, Z1, Z2] flattened for simplicity
    static double[][] forward(double[] X,
                               double[][] W1, double[] b1,
                               double[][] W2, double[] b2) {
        int nHid = W1.length, nIn = X.length, nOut = W2.length;
        double[] Z1 = new double[nHid], H1 = new double[nHid];
        double[] Z2 = new double[nOut], Yhat = new double[nOut];

        for (int i = 0; i < nHid; i++) {
            Z1[i] = b1[i];
            for (int j = 0; j < nIn; j++) Z1[i] += W1[i][j] * X[j];
            H1[i] = sigmoid(Z1[i]);
        }
        for (int i = 0; i < nOut; i++) {
            Z2[i] = b2[i];
            for (int j = 0; j < nHid; j++) Z2[i] += W2[i][j] * H1[j];
            Yhat[i] = sigmoid(Z2[i]);
        }
        return new double[][] {Yhat, H1, Z1, Z2};
    }

    public static void main(String[] args) {
        double[] X  = {0.5, 0.8};
        double[][] W1 = {{0.2, 0.4}, {0.3, 0.1}};
        double[] b1 = {0.1, 0.1};
        double[][] W2 = {{0.5, 0.6}};
        double[] b2 = {0.2};

        double[][] result = forward(X, W1, b1, W2, b2);
        System.out.printf("H1=[%.4f, %.4f]  Yhat=%.4f%n",
            result[1][0], result[1][1], result[0][0]);
    }
}

package main

import (
    "fmt"
    "math"
)

func sigmoid(z float64) float64 { return 1.0 / (1.0 + math.Exp(-z)) }

type ForwardResult struct {
    Yhat, H1, Z1, Z2 []float64
}

func forward(X []float64,
    W1 [][]float64, b1 []float64,
    W2 [][]float64, b2 []float64) ForwardResult {

    nHid, nIn, nOut := len(W1), len(X), len(W2)
    Z1 := make([]float64, nHid)
    H1 := make([]float64, nHid)
    Z2 := make([]float64, nOut)
    Yhat := make([]float64, nOut)

    for i := 0; i < nHid; i++ {
        Z1[i] = b1[i]
        for j := 0; j < nIn; j++ {
            Z1[i] += W1[i][j] * X[j]
        }
        H1[i] = sigmoid(Z1[i])
    }
    for i := 0; i < nOut; i++ {
        Z2[i] = b2[i]
        for j := 0; j < nHid; j++ {
            Z2[i] += W2[i][j] * H1[j]
        }
        Yhat[i] = sigmoid(Z2[i])
    }
    return ForwardResult{Yhat, H1, Z1, Z2}
}

func main() {
    r := forward(
        []float64{0.5, 0.8},
        [][]float64{{0.2, 0.4}, {0.3, 0.1}}, []float64{0.1, 0.1},
        [][]float64{{0.5, 0.6}}, []float64{0.2},
    )
    fmt.Printf("H1=%v  Yhat=%v\n", r.H1, r.Yhat)
}

6. The loss function

Once we have a prediction $\hat{y}$, we measure how wrong it is with a loss function. For regression and for pedagogical clarity we'll use the squared error:

L = \frac{1}{2}(\hat{y} - y)^2

The factor of $\frac{1}{2}$ is there purely so the derivative comes out clean: $\frac{\partial L}{\partial \hat{y}} = \hat{y} - y$.

For classification you'd use cross-entropy loss instead, but the backward-pass machinery is identical — only the formula for $\frac{\partial L}{\partial \hat{y}}$ changes.

With our numbers

After the forward pass we computed $\hat{y} \approx 0.703$. With target $y = 1$:

$L = \frac{1}{2}(0.703 - 1)^2 = \frac{1}{2}(0.0881) \approx 0.0441$

The network got $\hat{y} = 0.703$ but wanted $y = 1$. Our job now is to adjust every weight so that the next forward pass produces a $\hat{y}$ slightly closer to 1 — and therefore a slightly smaller loss.

Source — Loss Functions

function mse_loss(y_pred, y_true):
    # Mean squared error (with 1/2 factor for clean derivative)
    return 0.5 * (y_pred - y_true)^2

function mse_loss_grad(y_pred, y_true):
    # dL/dy_pred = y_pred - y_true
    return y_pred - y_true

function cross_entropy_loss(logits, y_true):
    # logits: raw pre-softmax scores, y_true: one-hot or class index
    probs = softmax(logits)
    return -sum(y_true * log(probs))   # sum over classes

function softmax(z):
    e = exp(z - max(z))                # subtract max for numerical stability
    return e / sum(e)

import numpy as np

def mse_loss(y_pred, y_true):
    """Scalar MSE with 1/2 factor. y_pred, y_true are scalars or arrays."""
    return 0.5 * np.sum((y_pred - y_true) ** 2)

def mse_loss_grad(y_pred, y_true):
    """dL/dy_pred for MSE. Same shape as y_pred."""
    return y_pred - y_true

def softmax(logits):
    """Numerically stable softmax. logits: (n_classes,)"""
    e = np.exp(logits - np.max(logits))
    return e / e.sum()

def cross_entropy_loss(logits, y_true):
    """
    logits: (n_classes,) raw scores
    y_true: (n_classes,) one-hot vector  OR  scalar class index
    """
    probs = softmax(logits)
    if np.isscalar(y_true) or y_true.ndim == 0:
        # class index form
        return -np.log(probs[int(y_true)] + 1e-15)
    # one-hot form
    return -np.sum(y_true * np.log(probs + 1e-15))

def cross_entropy_grad(logits, y_true):
    """
    Gradient of cross-entropy + softmax w.r.t. logits.
    Simplifies to: softmax(logits) - y_true  (famous cancellation).
    """
    probs = softmax(logits)
    if np.isscalar(y_true) or y_true.ndim == 0:
        one_hot = np.zeros_like(probs)
        one_hot[int(y_true)] = 1.0
        return probs - one_hot
    return probs - y_true

# --- Demonstrations ---
y_pred, y_true = 0.703, 1.0
print(f"MSE loss:      {mse_loss(y_pred, y_true):.4f}")
print(f"MSE gradient:  {mse_loss_grad(y_pred, y_true):.4f}")

logits = np.array([2.0, 1.0, 0.5])
y_cls  = 0  # true class is index 0
print(f"CE loss:       {cross_entropy_loss(logits, y_cls):.4f}")
print(f"CE gradient:   {cross_entropy_grad(logits, y_cls)}")

function mseLoss(yPred, yTrue) {
  // Scalar MSE with 1/2 factor
  const diff = yPred - yTrue;
  return 0.5 * diff * diff;
}

function mseLossGrad(yPred, yTrue) {
  return yPred - yTrue;
}

function softmax(logits) {
  const maxL = Math.max(...logits);
  const e = logits.map(x => Math.exp(x - maxL));
  const sum = e.reduce((a, b) => a + b, 0);
  return e.map(x => x / sum);
}

function crossEntropyLoss(logits, yTrue) {
  // yTrue: one-hot array or class index
  const probs = softmax(logits);
  if (typeof yTrue === 'number') {
    return -Math.log(probs[yTrue] + 1e-15);
  }
  return -yTrue.reduce((s, t, i) => s + t * Math.log(probs[i] + 1e-15), 0);
}

function crossEntropyGrad(logits, yTrue) {
  const probs = softmax(logits);
  if (typeof yTrue === 'number') {
    return probs.map((p, i) => p - (i === yTrue ? 1 : 0));
  }
  return probs.map((p, i) => p - yTrue[i]);
}

console.log("MSE loss:", mseLoss(0.703, 1.0).toFixed(4));
console.log("CE loss:", crossEntropyLoss([2.0, 1.0, 0.5], 0).toFixed(4));

#include <stdio.h>
#include <math.h>

double mse_loss(double y_pred, double y_true) {
    double d = y_pred - y_true;
    return 0.5 * d * d;
}

double mse_loss_grad(double y_pred, double y_true) {
    return y_pred - y_true;
}

/* Softmax in-place; n = number of classes */
void softmax(double *z, int n) {
    double max = z[0];
    for (int i = 1; i < n; i++) if (z[i] > max) max = z[i];
    double sum = 0.0;
    for (int i = 0; i < n; i++) { z[i] = exp(z[i] - max); sum += z[i]; }
    for (int i = 0; i < n; i++) z[i] /= sum;
}

/* Cross-entropy: one-hot y_true as array */
double cross_entropy_loss(double *logits, double *y_true, int n) {
    double probs[n];
    for (int i = 0; i < n; i++) probs[i] = logits[i];
    softmax(probs, n);
    double loss = 0.0;
    for (int i = 0; i < n; i++) loss -= y_true[i] * log(probs[i] + 1e-15);
    return loss;
}

int main(void) {
    printf("MSE loss: %.4f\n", mse_loss(0.703, 1.0));
    printf("MSE grad: %.4f\n", mse_loss_grad(0.703, 1.0));
    double logits[] = {2.0, 1.0, 0.5};
    double y_true[] = {1.0, 0.0, 0.0};
    printf("CE loss:  %.4f\n", cross_entropy_loss(logits, y_true, 3));
    return 0;
}

#include <cmath>
#include <vector>
#include <algorithm>
#include <iostream>

double mseLoss(double yPred, double yTrue) {
    double d = yPred - yTrue;
    return 0.5 * d * d;
}

std::vector<double> softmax(const std::vector<double>& logits) {
    double maxL = *std::max_element(logits.begin(), logits.end());
    std::vector<double> e(logits.size());
    double sum = 0;
    for (size_t i = 0; i < logits.size(); i++) {
        e[i] = std::exp(logits[i] - maxL);
        sum += e[i];
    }
    for (auto& v : e) v /= sum;
    return e;
}

double crossEntropyLoss(const std::vector<double>& logits,
                         const std::vector<double>& yTrue) {
    auto probs = softmax(logits);
    double loss = 0;
    for (size_t i = 0; i < probs.size(); i++)
        loss -= yTrue[i] * std::log(probs[i] + 1e-15);
    return loss;
}

int main() {
    std::cout << "MSE loss: " << mseLoss(0.703, 1.0) << "\n";
    std::vector<double> logits{2.0, 1.0, 0.5};
    std::vector<double> yTrue{1.0, 0.0, 0.0};
    std::cout << "CE loss:  " << crossEntropyLoss(logits, yTrue) << "\n";
}

public class LossFunctions {

    static double mseLoss(double yPred, double yTrue) {
        double d = yPred - yTrue;
        return 0.5 * d * d;
    }

    static double mseLossGrad(double yPred, double yTrue) {
        return yPred - yTrue;
    }

    static double[] softmax(double[] logits) {
        double max = logits[0];
        for (double v : logits) if (v > max) max = v;
        double[] e = new double[logits.length];
        double sum = 0;
        for (int i = 0; i < logits.length; i++) {
            e[i] = Math.exp(logits[i] - max);
            sum += e[i];
        }
        for (int i = 0; i < e.length; i++) e[i] /= sum;
        return e;
    }

    static double crossEntropyLoss(double[] logits, double[] yTrue) {
        double[] probs = softmax(logits);
        double loss = 0;
        for (int i = 0; i < probs.length; i++)
            loss -= yTrue[i] * Math.log(probs[i] + 1e-15);
        return loss;
    }

    public static void main(String[] args) {
        System.out.printf("MSE loss: %.4f%n", mseLoss(0.703, 1.0));
        double[] logits = {2.0, 1.0, 0.5};
        double[] yTrue  = {1.0, 0.0, 0.0};
        System.out.printf("CE loss:  %.4f%n", crossEntropyLoss(logits, yTrue));
    }
}

package main

import (
    "fmt"
    "math"
)

func mseLoss(yPred, yTrue float64) float64 {
    d := yPred - yTrue
    return 0.5 * d * d
}

func mseLossGrad(yPred, yTrue float64) float64 {
    return yPred - yTrue
}

func softmax(logits []float64) []float64 {
    maxL := logits[0]
    for _, v := range logits {
        if v > maxL { maxL = v }
    }
    e := make([]float64, len(logits))
    sum := 0.0
    for i, v := range logits {
        e[i] = math.Exp(v - maxL)
        sum += e[i]
    }
    for i := range e { e[i] /= sum }
    return e
}

func crossEntropyLoss(logits, yTrue []float64) float64 {
    probs := softmax(logits)
    loss := 0.0
    for i := range probs {
        loss -= yTrue[i] * math.Log(probs[i]+1e-15)
    }
    return loss
}

func main() {
    fmt.Printf("MSE loss: %.4f\n", mseLoss(0.703, 1.0))
    fmt.Printf("CE loss:  %.4f\n", crossEntropyLoss(
        []float64{2.0, 1.0, 0.5},
        []float64{1.0, 0.0, 0.0},
    ))
}

7. The four fundamental equations

Backpropagation for a feedforward network can be stated as four equations. We'll derive them below; here they are stated cleanly. Let $L$ be the layer index (we'll use $L=2$ for the output layer), let $\delta^{(L)}_j = \partial L / \partial z^{(L)}_j$ be the "error signal" at neuron $j$ of layer $L$, and let $\odot$ denote element-wise product.

The four BP equations

\begin{aligned} \text{(BP1) Output error:} \quad & \delta^{(L)} = \nabla_{\hat{y}} L \odot \sigma'(z^{(L)}) \\ \text{(BP2) Backprop through a layer:} \quad & \delta^{(l)} = \left((W^{(l+1)})^\top \delta^{(l+1)}\right) \odot \sigma'(z^{(l)}) \\ \text{(BP3) Bias gradient:} \quad & \frac{\partial L}{\partial b^{(l)}_j} = \delta^{(l)}_j \\ \text{(BP4) Weight gradient:} \quad & \frac{\partial L}{\partial w^{(l)}_{jk}} = h^{(l-1)}_k \cdot \delta^{(l)}_j \end{aligned}

The four equations, every symbol decoded

$l$: Layer index. $l = 1$ is the first hidden layer, $l = L$ is the output layer.
$z^{(l)}$: Pre-activation of layer $l$ — the weighted sum before the activation function is applied. $z^{(l)} = W^{(l)} h^{(l-1)} + b^{(l)}$.
$h^{(l)}$: Post-activation output of layer $l$: $h^{(l)} = \sigma(z^{(l)})$. Fed forward into layer $l+1$.
$\hat{y}$: Network's final prediction, equal to $h^{(L)}$.
$y$: The ground-truth target we're trying to match.
$L$: The loss — a scalar function of $\hat{y}$ and $y$ (e.g. MSE $\frac{1}{2}(\hat{y}-y)^2$ or cross-entropy). Careful: we reuse $L$ for "loss" and "total number of layers." Context always disambiguates.
$\delta^{(l)}_j$: "Error signal" at neuron $j$ of layer $l$. Defined as $\partial L / \partial z^{(l)}_j$. "If I nudged this neuron's pre-activation, how much would the loss change?" That's all a delta is.
$\sigma'(z)$: Derivative of the activation function, evaluated at the pre-activation. For sigmoid: $\sigma'(z) = \sigma(z)(1-\sigma(z))$. For ReLU: 1 if $z>0$ else 0.
$\odot$: Elementwise (Hadamard) product — multiply vectors coordinate by coordinate. Different from matrix multiply.
$W^{(l+1)}$: Weight matrix of the next layer, shape (neurons in layer l+1) × (neurons in layer l).
$(W^{(l+1)})^\top \delta^{(l+1)}$: The key matrix-vector product of backprop. It routes each downstream error back to the upstream neurons that contributed to it, weighted by the connection strength. This is literally the forward matrix multiply, run in reverse. Backprop isn't a new algorithm — it's the forward pass with the same matrices transposed.
$\nabla_{\hat{y}} L$: Gradient of the loss with respect to the final prediction. For MSE, this is simply $\hat{y} - y$. For cross-entropy with softmax, it simplifies to $\hat{y} - y$ as well (a famous cancellation).

Analogy Imagine a factory with four production lines in a row. A defect shows up in the final product (that's $\nabla_{\hat{y}} L$). You want to know which machine to blame. BP1 says: divide the final defect by how sensitive the last machine is to its input — that's your first "blame score." BP2 says: to blame the second-to-last machine, look at how much it fed into the last one (that's $W$), weighted by how much each of its outputs mattered (that's the local derivative). Multiply. Repeat all the way back to machine one. BP3 and BP4 convert blame scores into concrete knob-turn prescriptions: "twist this knob by exactly this much to reduce the defect." That's the entire algorithm.

Read them slowly. BP1 says: to get the error signal at the output layer, take the gradient of the loss with respect to the final activation and multiply by the derivative of the activation function. BP2 says: to propagate the error one layer backward, multiply by the transposed weight matrix and then by $\sigma'$. BP3 and BP4 convert the error signal into actual weight and bias gradients.

Everything else in deep learning — optimizers, regularization, dropout — builds on top of these four lines.

8. Derivation from the chain rule

8.1 BP1: the output layer

We want $\delta^{(2)} = \frac{\partial L}{\partial z^{(2)}}$. The path from $L$ to $z^{(2)}$ goes through $\hat{y} = \sigma(z^{(2)})$. By the chain rule:

\delta^{(2)} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}}

Since $L = \frac{1}{2}(\hat{y} - y)^2$, we have $\partial L / \partial \hat{y} = \hat{y} - y$. And $\partial \hat{y} / \partial z^{(2)} = \sigma'(z^{(2)}) = \hat{y}(1 - \hat{y})$. So:

\delta^{(2)} = (\hat{y} - y)\,\hat{y}(1 - \hat{y})

8.2 BP3 and BP4 at the output layer

The output pre-activation is $z^{(2)} = w^{(2)}_1 h_1 + w^{(2)}_2 h_2 + b^{(2)}$. Differentiating with respect to $w^{(2)}_1$:

\frac{\partial L}{\partial w^{(2)}_1} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial w^{(2)}_1} = \delta^{(2)} \cdot h_1

And similarly $\partial L / \partial w^{(2)}_2 = \delta^{(2)} \cdot h_2$ and $\partial L / \partial b^{(2)} = \delta^{(2)}$ (because $\partial z^{(2)} / \partial b^{(2)} = 1$).

Notice the pattern: to compute the gradient for weight $w$ entering neuron $j$, you multiply the error signal $\delta_j$ by whatever value is flowing into that weight from the previous layer. That's BP4 in words.

8.3 BP2: propagating error backward

Now we want $\delta^{(1)}_1 = \frac{\partial L}{\partial z^{(1)}_1}$. What paths lead from $z^{(1)}_1$ to $L$? Only one — via $h_1$, then through the output neuron:

z^{(1)}_1 \to h_1 \to z^{(2)} \to \hat{y} \to L

By the chain rule:

\delta^{(1)}_1 = \frac{\partial L}{\partial z^{(1)}_1} = \frac{\partial L}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial h_1} \cdot \frac{\partial h_1}{\partial z^{(1)}_1}

Identifying the terms: $\partial L / \partial z^{(2)} = \delta^{(2)}$; $\partial z^{(2)} / \partial h_1 = w^{(2)}_1$ (because $z^{(2)}$ is linear in $h_1$); and $\partial h_1 / \partial z^{(1)}_1 = \sigma'(z^{(1)}_1)$. So:

\delta^{(1)}_1 = \delta^{(2)} \cdot w^{(2)}_1 \cdot \sigma'(z^{(1)}_1)

In a network with more than one output neuron, hidden unit $h_1$ would feed multiple output pre-activations $z^{(2)}_1, z^{(2)}_2, \dots$, so there would be multiple paths from $z^{(1)}_1$ to $L$. The chain rule says you sum over them:

\delta^{(l)}_k = \sigma'(z^{(l)}_k) \sum_j w^{(l+1)}_{kj} \, \delta^{(l+1)}_j

In matrix form, that sum is exactly the transposed weight matrix times the next layer's error — which is (BP2).

8.4 BP4 for the hidden layer

Once we have $\delta^{(1)}_1$, the weight gradient in the first layer is the same pattern as before: "error in, signal coming in from below":

\frac{\partial L}{\partial w^{(1)}_{11}} = \delta^{(1)}_1 \cdot x_1, \qquad \frac{\partial L}{\partial w^{(1)}_{21}} = \delta^{(1)}_1 \cdot x_2

And that's all four BP equations, derived from nothing more than the chain rule.

Why this is a "backward pass"

Observe that BP2 computes $\delta^{(l)}$ from $\delta^{(l+1)}$. That is, we need to know the error signal at layer $l+1$ before we can compute it at layer $l$. So we march from the output layer back toward the input, one layer at a time, reusing the $\delta$ we just computed. This is why we call it backpropagation — the arithmetic literally proceeds backward through the network.

9. Backward pass

Now let's plug our running numbers through BP1 → BP4 and compute every gradient.

Values cached from the forward pass

$z^{(1)}_1 = 0.44, \quad h_1 = \sigma(0.44) \approx 0.6083$

$z^{(1)}_2 = 0.38, \quad h_2 = \sigma(0.38) \approx 0.5939$

$z^{(2)} \approx 0.8605, \quad \hat{y} = \sigma(0.8605) \approx 0.7029$

$L \approx 0.0441$

Step 1 — BP1 at the output

\delta^{(2)} = (\hat{y} - y) \cdot \hat{y}(1-\hat{y}) = (0.7029 - 1)(0.7029)(0.2971) \approx -0.0620

Step 2 — BP3/BP4 at the output

\frac{\partial L}{\partial w^{(2)}_1} = \delta^{(2)} \cdot h_1 = (-0.0620)(0.6083) \approx -0.0377

\frac{\partial L}{\partial w^{(2)}_2} = \delta^{(2)} \cdot h_2 = (-0.0620)(0.5939) \approx -0.0368

\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)} \approx -0.0620

Step 3 — BP2 for the hidden layer

\sigma'(z^{(1)}_1) = h_1(1-h_1) = (0.6083)(0.3917) \approx 0.2383

\delta^{(1)}_1 = \delta^{(2)} \cdot w^{(2)}_1 \cdot \sigma'(z^{(1)}_1) = (-0.0620)(0.5)(0.2383) \approx -0.00739

\sigma'(z^{(1)}_2) = h_2(1-h_2) \approx 0.2412

\delta^{(1)}_2 = (-0.0620)(0.6)(0.2412) \approx -0.00897

Step 4 — BP3/BP4 for the hidden layer

\begin{aligned} \partial L / \partial w^{(1)}_{11} &= \delta^{(1)}_1 \cdot x_1 \approx -0.00370 \\ \partial L / \partial w^{(1)}_{21} &= \delta^{(1)}_1 \cdot x_2 \approx -0.00591 \\ \partial L / \partial w^{(1)}_{12} &= \delta^{(1)}_2 \cdot x_1 \approx -0.00449 \\ \partial L / \partial w^{(1)}_{22} &= \delta^{(1)}_2 \cdot x_2 \approx -0.00717 \\ \partial L / \partial b^{(1)}_1 &= \delta^{(1)}_1 \approx -0.00739 \\ \partial L / \partial b^{(1)}_2 &= \delta^{(1)}_2 \approx -0.00897 \end{aligned}

All gradients are negative. That means each weight should be increased to reduce the loss (since $w \leftarrow w - \eta \cdot (\text{neg}) = w + \text{pos}$). Intuitively: the network underestimated $\hat{y}$, so we want every weight pointing toward the output to grow.

Interactive: step through the backward pass

This figure uses the same network state as the forward demo above (they share parameters). Run the forward pass first, then step through backward to watch each gradient appear.

▸ Backward pass — gradients appear one at a time Step 0 / 8

READY

The forward pass values are cached. Press Step → to begin the backward pass from the loss.

Source — Backpropagation (Chain Rule)

# Assumes forward pass already ran, storing Z1, H1, Z2, Yhat
# Loss: MSE  L = 0.5*(Yhat - y)^2

function backward(X, y, W1, b1, W2, b2, Z1, H1, Z2, Yhat):
    # BP1: output error
    dL_dYhat = Yhat - y
    dYhat_dZ2 = Yhat * (1 - Yhat)        # sigmoid derivative
    delta2 = dL_dYhat * dYhat_dZ2        # scalar (1,)

    # BP4 / BP3 at output layer
    dL_dW2 = delta2 * H1                 # (n_hid,)
    dL_db2 = delta2                      # scalar

    # BP2: propagate error to hidden layer
    dZ2_dH1 = W2                         # (n_hid,)
    dH1_dZ1 = H1 * (1 - H1)             # sigmoid derivative, element-wise
    delta1 = (W2 * delta2) * dH1_dZ1    # (n_hid,)

    # BP4 / BP3 at hidden layer
    dL_dW1 = outer(delta1, X)            # (n_hid, n_in)
    dL_db1 = delta1                      # (n_hid,)

    return dL_dW2, dL_db2, dL_dW1, dL_db1

import numpy as np

def sigmoid(z):     return 1.0 / (1.0 + np.exp(-z))
def sig_prime(a):   return a * (1.0 - a)   # a = sigmoid(z)

def backward(X, y, W2, H1, Yhat):
    """
    Gradients for a 2-layer MLP with sigmoid activations and MSE loss.
    X:    (n_in,)  input
    y:    scalar   target
    W2:   (1, n_hid) output weights
    H1:   (n_hid,) hidden activations from forward pass
    Yhat: (1,)     output activation from forward pass
    Returns: dW2, db2, dW1, db1
    """
    # BP1
    delta2 = (Yhat - y) * sig_prime(Yhat)    # (1,)

    # BP4 / BP3 for output layer
    dW2 = delta2[:, None] * H1[None, :]      # (1, n_hid)
    db2 = delta2                              # (1,)

    # BP2: propagate to hidden layer
    delta1 = (W2.T @ delta2) * sig_prime(H1) # (n_hid,)

    # BP4 / BP3 for hidden layer
    dW1 = np.outer(delta1, X)                # (n_hid, n_in)
    db1 = delta1                             # (n_hid,)

    return dW2, db2, dW1, db1

# --- Full example ---
W1 = np.array([[0.2, 0.4], [0.3, 0.1]])
b1 = np.array([0.1, 0.1])
W2 = np.array([[0.5, 0.6]])
b2 = np.array([0.2])
X  = np.array([0.5, 0.8])
y  = 1.0

Z1   = W1 @ X + b1
H1   = sigmoid(Z1)
Z2   = W2 @ H1 + b2
Yhat = sigmoid(Z2)

dW2, db2, dW1, db1 = backward(X, y, W2, H1, Yhat)
print("dW2:", dW2)
print("dW1:", dW1)

function sigmoid(z) { return 1 / (1 + Math.exp(-z)); }
function sigPrime(a) { return a * (1 - a); }

function backward(X, y, W2, H1, Yhat) {
  // BP1: output error (scalar for 1-output network)
  const delta2 = (Yhat[0] - y) * sigPrime(Yhat[0]);

  // BP4 / BP3 for output layer
  const dW2 = H1.map(h => [delta2 * h]);     // [[dw2_1], [dw2_2]] -> shape (n_hid,1)
  const db2 = [delta2];

  // BP2: propagate to hidden (W2 is [[w1, w2]])
  const delta1 = H1.map((h, j) => W2[0][j] * delta2 * sigPrime(h));

  // BP4 / BP3 for hidden layer
  const dW1 = delta1.map(d => X.map(x => d * x));  // (n_hid, n_in)
  const db1 = delta1;

  return { dW2, db2, dW1, db1 };
}

// Example
const W1 = [[0.2, 0.4], [0.3, 0.1]];
const b1 = [0.1, 0.1];
const W2 = [[0.5, 0.6]];
const b2 = [0.2];
const X  = [0.5, 0.8];
const y  = 1.0;

const Z1   = W1.map(row => row.reduce((s, w, j) => s + w * X[j], 0)).map((z, i) => z + b1[i]);
const H1   = Z1.map(sigmoid);
const Z2   = W2.map(row => row.reduce((s, w, j) => s + w * H1[j], 0)).map((z, i) => z + b2[i]);
const Yhat = Z2.map(sigmoid);

const { dW2, dW1 } = backward(X, y, W2, H1, Yhat);
console.log("dW2:", dW2, "dW1:", dW1);

#include <stdio.h>
#include <math.h>

#define N_IN  2
#define N_HID 2
#define N_OUT 1

double sigmoid(double z) { return 1.0 / (1.0 + exp(-z)); }
double sig_prime(double a) { return a * (1.0 - a); }

void backward(
    double X[N_IN],
    double y,
    double W2[N_OUT][N_HID],
    double H1[N_HID],
    double Yhat[N_OUT],
    double dW2[N_OUT][N_HID], double db2[N_OUT],
    double dW1[N_HID][N_IN],  double db1[N_HID])
{
    double delta2[N_OUT];
    for (int i = 0; i < N_OUT; i++)
        delta2[i] = (Yhat[i] - y) * sig_prime(Yhat[i]);

    for (int i = 0; i < N_OUT; i++) {
        for (int j = 0; j < N_HID; j++) dW2[i][j] = delta2[i] * H1[j];
        db2[i] = delta2[i];
    }

    double delta1[N_HID];
    for (int j = 0; j < N_HID; j++) {
        double s = 0.0;
        for (int i = 0; i < N_OUT; i++) s += W2[i][j] * delta2[i];
        delta1[j] = s * sig_prime(H1[j]);
    }

    for (int j = 0; j < N_HID; j++) {
        for (int k = 0; k < N_IN; k++) dW1[j][k] = delta1[j] * X[k];
        db1[j] = delta1[j];
    }
}

int main(void) {
    double X[N_IN]           = {0.5, 0.8};
    double W1[N_HID][N_IN]   = {{0.2, 0.4}, {0.3, 0.1}};
    double b1[N_HID]         = {0.1, 0.1};
    double W2[N_OUT][N_HID]  = {{0.5, 0.6}};
    double b2[N_OUT]         = {0.2};
    double Z1[N_HID], H1[N_HID], Z2[N_OUT], Yhat[N_OUT];

    for (int i = 0; i < N_HID; i++) {
        Z1[i] = b1[i];
        for (int j = 0; j < N_IN; j++) Z1[i] += W1[i][j] * X[j];
        H1[i] = sigmoid(Z1[i]);
    }
    for (int i = 0; i < N_OUT; i++) {
        Z2[i] = b2[i];
        for (int j = 0; j < N_HID; j++) Z2[i] += W2[i][j] * H1[j];
        Yhat[i] = sigmoid(Z2[i]);
    }

    double dW2[N_OUT][N_HID], db2[N_OUT], dW1[N_HID][N_IN], db1[N_HID];
    backward(X, 1.0, W2, H1, Yhat, dW2, db2, dW1, db1);
    printf("dW2 = [%.5f, %.5f]\n", dW2[0][0], dW2[0][1]);
    printf("dW1 = [[%.5f, %.5f], [%.5f, %.5f]]\n",
           dW1[0][0], dW1[0][1], dW1[1][0], dW1[1][1]);
    return 0;
}

#include <cmath>
#include <vector>
#include <iostream>

double sigmoid(double z) { return 1.0 / (1.0 + std::exp(-z)); }
double sigPrime(double a) { return a * (1.0 - a); }

struct Grads {
    std::vector<std::vector<double>> dW2, dW1;
    std::vector<double> db2, db1;
};

Grads backward(
    const std::vector<double>& X, double y,
    const std::vector<std::vector<double>>& W2,
    const std::vector<double>& H1,
    const std::vector<double>& Yhat)
{
    int nOut = Yhat.size(), nHid = H1.size(), nIn = X.size();
    std::vector<double> delta2(nOut), delta1(nHid);
    Grads g;

    for (int i = 0; i < nOut; i++)
        delta2[i] = (Yhat[i] - y) * sigPrime(Yhat[i]);

    g.dW2.resize(nOut, std::vector<double>(nHid));
    g.db2.resize(nOut);
    for (int i = 0; i < nOut; i++) {
        for (int j = 0; j < nHid; j++) g.dW2[i][j] = delta2[i] * H1[j];
        g.db2[i] = delta2[i];
    }

    for (int j = 0; j < nHid; j++) {
        double s = 0;
        for (int i = 0; i < nOut; i++) s += W2[i][j] * delta2[i];
        delta1[j] = s * sigPrime(H1[j]);
    }

    g.dW1.resize(nHid, std::vector<double>(nIn));
    g.db1.resize(nHid);
    for (int j = 0; j < nHid; j++) {
        for (int k = 0; k < nIn; k++) g.dW1[j][k] = delta1[j] * X[k];
        g.db1[j] = delta1[j];
    }
    return g;
}

int main() {
    // Forward pass first
    std::vector<double> X{0.5, 0.8};
    std::vector<std::vector<double>> W1{{0.2, 0.4}, {0.3, 0.1}};
    std::vector<double> b1{0.1, 0.1};
    std::vector<std::vector<double>> W2{{0.5, 0.6}};
    std::vector<double> b2{0.2};

    std::vector<double> H1(2), Yhat(1);
    for (int i = 0; i < 2; i++) {
        double z = b1[i]; for (int j = 0; j < 2; j++) z += W1[i][j]*X[j];
        H1[i] = sigmoid(z);
    }
    { double z = b2[0]; for (int j = 0; j < 2; j++) z += W2[0][j]*H1[j]; Yhat[0] = sigmoid(z); }

    auto g = backward(X, 1.0, W2, H1, Yhat);
    std::cout << "dW2=[" << g.dW2[0][0] << ", " << g.dW2[0][1] << "]\n";
}

public class Backprop {

    static double sigmoid(double z) { return 1.0 / (1.0 + Math.exp(-z)); }
    static double sigPrime(double a) { return a * (1.0 - a); }

    static double[][][] backward(
            double[] X, double y,
            double[][] W2, double[] H1, double[] Yhat) {
        int nOut = Yhat.length, nHid = H1.length, nIn = X.length;
        double[] delta2 = new double[nOut];
        double[] delta1 = new double[nHid];
        double[][] dW2 = new double[nOut][nHid];
        double[] db2 = new double[nOut];
        double[][] dW1 = new double[nHid][nIn];
        double[] db1 = new double[nHid];

        for (int i = 0; i < nOut; i++)
            delta2[i] = (Yhat[i] - y) * sigPrime(Yhat[i]);

        for (int i = 0; i < nOut; i++) {
            for (int j = 0; j < nHid; j++) dW2[i][j] = delta2[i] * H1[j];
            db2[i] = delta2[i];
        }

        for (int j = 0; j < nHid; j++) {
            double s = 0;
            for (int i = 0; i < nOut; i++) s += W2[i][j] * delta2[i];
            delta1[j] = s * sigPrime(H1[j]);
        }

        for (int j = 0; j < nHid; j++) {
            for (int k = 0; k < nIn; k++) dW1[j][k] = delta1[j] * X[k];
            db1[j] = delta1[j];
        }
        // Pack for return: [dW2, dW1, db2_as_2d, db1_as_2d]
        return new double[][][] {dW2, dW1};
    }

    public static void main(String[] args) {
        double[] X    = {0.5, 0.8};
        double[][] W1 = {{0.2, 0.4}, {0.3, 0.1}};
        double[] b1   = {0.1, 0.1};
        double[][] W2 = {{0.5, 0.6}};
        double[] b2   = {0.2};

        // Forward
        double[] H1   = new double[2];
        double[] Yhat = new double[1];
        for (int i = 0; i < 2; i++) {
            double z = b1[i];
            for (int j = 0; j < 2; j++) z += W1[i][j] * X[j];
            H1[i] = sigmoid(z);
        }
        double z = b2[0];
        for (int j = 0; j < 2; j++) z += W2[0][j] * H1[j];
        Yhat[0] = sigmoid(z);

        double[][][] grads = backward(X, 1.0, W2, H1, Yhat);
        System.out.printf("dW2=[%.5f, %.5f]%n", grads[0][0][0], grads[0][0][1]);
    }
}

package main

import (
    "fmt"
    "math"
)

func sigmoid(z float64) float64 { return 1.0 / (1.0 + math.Exp(-z)) }
func sigPrime(a float64) float64 { return a * (1.0 - a) }

type Grads struct {
    DW2, DW1     [][]float64
    Db2, Db1     []float64
}

func backward(X []float64, y float64,
    W2 [][]float64, H1 []float64, Yhat []float64) Grads {

    nOut, nHid, nIn := len(Yhat), len(H1), len(X)
    delta2 := make([]float64, nOut)
    delta1 := make([]float64, nHid)

    for i := 0; i < nOut; i++ {
        delta2[i] = (Yhat[i] - y) * sigPrime(Yhat[i])
    }

    dW2 := make([][]float64, nOut)
    db2 := make([]float64, nOut)
    for i := 0; i < nOut; i++ {
        dW2[i] = make([]float64, nHid)
        for j := 0; j < nHid; j++ { dW2[i][j] = delta2[i] * H1[j] }
        db2[i] = delta2[i]
    }

    for j := 0; j < nHid; j++ {
        s := 0.0
        for i := 0; i < nOut; i++ { s += W2[i][j] * delta2[i] }
        delta1[j] = s * sigPrime(H1[j])
    }

    dW1 := make([][]float64, nHid)
    db1 := make([]float64, nHid)
    for j := 0; j < nHid; j++ {
        dW1[j] = make([]float64, nIn)
        for k := 0; k < nIn; k++ { dW1[j][k] = delta1[j] * X[k] }
        db1[j] = delta1[j]
    }
    return Grads{dW2, dW1, db2, db1}
}

func main() {
    // Forward pass values (from running forward() first)
    H1   := []float64{0.6083, 0.5939}
    Yhat := []float64{0.7029}
    W2   := [][]float64{{0.5, 0.6}}
    X    := []float64{0.5, 0.8}

    g := backward(X, 1.0, W2, H1, Yhat)
    fmt.Printf("dW2=%v\ndW1=%v\n", g.DW2, g.DW1)
}

10. Gradient descent update

Now that we have every gradient, we apply the update rule to each weight and bias simultaneously. With learning rate $\eta = 0.5$:

\begin{aligned} w^{(2)}_1 &\leftarrow 0.5 - 0.5 \cdot (-0.0377) = 0.5189 \\ w^{(2)}_2 &\leftarrow 0.6 - 0.5 \cdot (-0.0368) = 0.6184 \\ b^{(2)} &\leftarrow 0.2 - 0.5 \cdot (-0.0620) = 0.2310 \end{aligned}

And similarly for the hidden layer. If we now run a new forward pass with these updated weights on the same input $x = [0.5, 0.8]$, we'll find $\hat{y}$ has moved slightly closer to the target $y = 1$. Repeating this for thousands of (input, target) pairs over many epochs drives the loss down across the whole dataset.

11. The full algorithm

BACKPROPAGATION (pseudocode)

# Given: network with L layers, loss L, learning rate η
# Given: training pair (x, y)

# --- Forward pass ---
h[0] = x
for l in 1..L:
    z[l] = W[l] @ h[l-1] + b[l]
    h[l] = activation(z[l])
loss = L(h[L], y)

# --- Backward pass ---
delta[L] = grad_output_loss(h[L], y) * activation_deriv(z[L])   # BP1
for l in L..1 (reverse):
    grad_W[l] = delta[l] @ h[l-1].T                              # BP4
    grad_b[l] = delta[l]                                         # BP3
    if l > 1:
        delta[l-1] = (W[l].T @ delta[l]) * activation_deriv(z[l-1])  # BP2

# --- Update ---
for l in 1..L:
    W[l] -= η * grad_W[l]
    b[l] -= η * grad_b[l]

Every deep learning framework (PyTorch, JAX, TensorFlow) implements a generalized version of this via reverse-mode automatic differentiation. You write the forward pass; the framework records the computation graph and then traverses it backward computing local derivatives. That's why you never have to write backward passes by hand anymore — but you still have to understand what's happening when gradients vanish, explode, or disappear mysteriously.

Source — Full Training Loop

function train(dataset, W1, b1, W2, b2, lr, n_epochs, batch_size):
    for epoch in 1..n_epochs:
        shuffle(dataset)
        for batch in mini_batches(dataset, batch_size):
            # Accumulate gradients over the batch
            dW1, db1, dW2, db2 = zeros_like(W1, b1, W2, b2)
            total_loss = 0
            for (x, y) in batch:
                Yhat, H1, Z1, Z2 = forward(x, W1, b1, W2, b2)
                total_loss += mse_loss(Yhat, y)
                gW2, gb2, gW1, gb1 = backward(x, y, W2, H1, Yhat)
                dW2 += gW2
                db2 += gb2
                dW1 += gW1
                db1 += gb1

            # Average gradients and update
            m = len(batch)
            W2 -= lr * dW2 / m
            b2 -= lr * db2 / m
            W1 -= lr * dW1 / m
            b1 -= lr * db1 / m

        print(f"Epoch {epoch}  avg_loss={total_loss/len(dataset):.4f}")

import numpy as np

def sigmoid(z):   return 1.0 / (1.0 + np.exp(-z))
def sig_prime(a): return a * (1.0 - a)

def forward(X, W1, b1, W2, b2):
    Z1   = W1 @ X + b1
    H1   = sigmoid(Z1)
    Z2   = W2 @ H1 + b2
    Yhat = sigmoid(Z2)
    return Yhat, H1, Z1, Z2

def backward(X, y, W2, H1, Yhat):
    delta2 = (Yhat - y) * sig_prime(Yhat)
    dW2    = delta2[:, None] * H1[None, :]
    db2    = delta2
    delta1 = (W2.T @ delta2) * sig_prime(H1)
    dW1    = np.outer(delta1, X)
    db1    = delta1
    return dW2, db2, dW1, db1

def train(X_data, y_data, W1, b1, W2, b2,
          lr=0.5, n_epochs=100, batch_size=4):
    m = len(X_data)
    for epoch in range(n_epochs):
        idx   = np.random.permutation(m)
        total_loss = 0.0
        for start in range(0, m, batch_size):
            batch = idx[start:start + batch_size]
            dW1_acc = np.zeros_like(W1)
            db1_acc = np.zeros_like(b1)
            dW2_acc = np.zeros_like(W2)
            db2_acc = np.zeros_like(b2)
            for i in batch:
                Yhat, H1, _, _ = forward(X_data[i], W1, b1, W2, b2)
                total_loss     += 0.5 * float((Yhat - y_data[i]) ** 2)
                dW2, db2, dW1, db1 = backward(X_data[i], y_data[i], W2, H1, Yhat)
                dW2_acc += dW2; db2_acc += db2
                dW1_acc += dW1; db1_acc += db1
            k = len(batch)
            W2 -= lr * dW2_acc / k;  b2 -= lr * db2_acc / k
            W1 -= lr * dW1_acc / k;  b1 -= lr * db1_acc / k
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1:3d}  loss={total_loss/m:.5f}")
    return W1, b1, W2, b2

# Tiny demo: learn XOR-like single-output
rng = np.random.default_rng(42)
W1 = rng.normal(0, 0.5, (4, 2))
b1 = np.zeros(4)
W2 = rng.normal(0, 0.5, (1, 4))
b2 = np.zeros(1)

X_data = np.array([[0.,0.],[0.,1.],[1.,0.],[1.,1.]])
y_data = np.array([[0.],[1.],[1.],[0.]])

W1, b1, W2, b2 = train(X_data, y_data, W1, b1, W2, b2, lr=1.0, n_epochs=500)

function sigmoid(z) { return 1 / (1 + Math.exp(-z)); }
function sigPrime(a) { return a * (1 - a); }

function matVec(A, v) {
  return A.map(row => row.reduce((s, w, j) => s + w * v[j], 0));
}

function forward(X, W1, b1, W2, b2) {
  const Z1   = matVec(W1, X).map((z, i) => z + b1[i]);
  const H1   = Z1.map(sigmoid);
  const Z2   = matVec(W2, H1).map((z, i) => z + b2[i]);
  const Yhat = Z2.map(sigmoid);
  return { Yhat, H1 };
}

function backward(X, y, W2, H1, Yhat) {
  const delta2 = (Yhat[0] - y) * sigPrime(Yhat[0]);
  const dW2    = H1.map(h => [delta2 * h]);
  const db2    = [delta2];
  const delta1 = H1.map((h, j) => W2[0][j] * delta2 * sigPrime(h));
  const dW1    = delta1.map(d => X.map(x => d * x));
  const db1    = delta1;
  return { dW2, db2, dW1, db1 };
}

function train(Xdata, ydata, W1, b1, W2, b2, lr = 0.5, nEpochs = 200) {
  for (let ep = 0; ep < nEpochs; ep++) {
    let totalLoss = 0;
    for (let i = 0; i < Xdata.length; i++) {
      const { Yhat, H1 } = forward(Xdata[i], W1, b1, W2, b2);
      totalLoss += 0.5 * (Yhat[0] - ydata[i]) ** 2;
      const g = backward(Xdata[i], ydata[i], W2, H1, Yhat);
      // update output layer
      W2 = W2.map((row, i) => row.map((w, j) => w - lr * g.dW2[j][0]));
      b2 = b2.map((b, i) => b - lr * g.db2[i]);
      // update hidden layer
      W1 = W1.map((row, i) => row.map((w, j) => w - lr * g.dW1[i][j]));
      b1 = b1.map((b, i) => b - lr * g.db1[i]);
    }
    if ((ep + 1) % 50 === 0)
      console.log(`Epoch ${ep + 1}  loss=${(totalLoss / Xdata.length).toFixed(5)}`);
  }
  return { W1, b1, W2, b2 };
}

// XOR demo
const Xdata = [[0,0],[0,1],[1,0],[1,1]];
const ydata = [0, 1, 1, 0];
let W1 = [[0.5,-0.3],[0.1,0.8],[-0.4,0.6],[0.2,-0.5]];
let b1 = [0,0,0,0];
let W2 = [[0.3,-0.2,0.5,-0.1]];
let b2 = [0];
train(Xdata, ydata, W1, b1, W2, b2, 1.0, 500);

/* Compact training loop for a 2-2-1 MLP with MSE loss */
#include <stdio.h>
#include <math.h>
#include <string.h>

#define NIN 2
#define NHID 2
#define NOUT 1
#define LR 0.5

double sigmoid(double z)    { return 1.0 / (1.0 + exp(-z)); }
double sig_prime(double a)  { return a * (1.0 - a); }

void train_step(
    double X[NIN], double y,
    double W1[NHID][NIN], double b1[NHID],
    double W2[NOUT][NHID], double b2[NOUT])
{
    /* Forward */
    double Z1[NHID], H1[NHID], Z2[NOUT], Yhat[NOUT];
    for (int i = 0; i < NHID; i++) {
        Z1[i] = b1[i];
        for (int j = 0; j < NIN; j++) Z1[i] += W1[i][j]*X[j];
        H1[i] = sigmoid(Z1[i]);
    }
    for (int i = 0; i < NOUT; i++) {
        Z2[i] = b2[i];
        for (int j = 0; j < NHID; j++) Z2[i] += W2[i][j]*H1[j];
        Yhat[i] = sigmoid(Z2[i]);
    }
    /* Backward */
    double delta2[NOUT], delta1[NHID];
    for (int i = 0; i < NOUT; i++)
        delta2[i] = (Yhat[i] - y) * sig_prime(Yhat[i]);
    for (int j = 0; j < NHID; j++) {
        double s = 0;
        for (int i = 0; i < NOUT; i++) s += W2[i][j]*delta2[i];
        delta1[j] = s * sig_prime(H1[j]);
    }
    /* Update */
    for (int i = 0; i < NOUT; i++) {
        for (int j = 0; j < NHID; j++) W2[i][j] -= LR * delta2[i] * H1[j];
        b2[i] -= LR * delta2[i];
    }
    for (int j = 0; j < NHID; j++) {
        for (int k = 0; k < NIN; k++) W1[j][k] -= LR * delta1[j] * X[k];
        b1[j] -= LR * delta1[j];
    }
}

int main(void) {
    double W1[NHID][NIN] = {{0.2,0.4},{0.3,0.1}};
    double b1[NHID]      = {0.1,0.1};
    double W2[NOUT][NHID]= {{0.5,0.6}};
    double b2[NOUT]      = {0.2};
    double X[NIN] = {0.5, 0.8};
    double y = 1.0;

    for (int ep = 0; ep < 200; ep++) {
        train_step(X, y, W1, b1, W2, b2);
        if ((ep+1) % 20 == 0) {
            double h1[NHID], yh = b2[0];
            for (int i=0;i<NHID;i++){double z=b1[i];for(int j=0;j<NIN;j++)z+=W1[i][j]*X[j];h1[i]=sigmoid(z);}
            for (int j=0;j<NHID;j++) yh+=W2[0][j]*h1[j];
            yh=sigmoid(yh);
            printf("Epoch %3d  yhat=%.4f  loss=%.5f\n", ep+1, yh, 0.5*(yh-y)*(yh-y));
        }
    }
    return 0;
}

#include <cmath>
#include <vector>
#include <iostream>
#include <random>

double sigmoid(double z)   { return 1.0 / (1.0 + std::exp(-z)); }
double sigPrime(double a)  { return a * (1.0 - a); }

struct MLP {
    std::vector<std::vector<double>> W1, W2;
    std::vector<double> b1, b2;

    MLP(int nIn, int nHid, int nOut) {
        std::mt19937 rng(42);
        std::normal_distribution<double> nd(0, 0.5);
        W1.assign(nHid, std::vector<double>(nIn));
        b1.assign(nHid, 0);
        W2.assign(nOut, std::vector<double>(nHid));
        b2.assign(nOut, 0);
        for (auto& row : W1) for (auto& v : row) v = nd(rng);
        for (auto& row : W2) for (auto& v : row) v = nd(rng);
    }

    std::pair<std::vector<double>, std::vector<double>>
    forward(const std::vector<double>& X) {
        std::vector<double> H1(W1.size()), Yhat(W2.size());
        for (size_t i = 0; i < W1.size(); i++) {
            double z = b1[i];
            for (size_t j = 0; j < X.size(); j++) z += W1[i][j]*X[j];
            H1[i] = sigmoid(z);
        }
        for (size_t i = 0; i < W2.size(); i++) {
            double z = b2[i];
            for (size_t j = 0; j < H1.size(); j++) z += W2[i][j]*H1[j];
            Yhat[i] = sigmoid(z);
        }
        return {H1, Yhat};
    }

    void trainStep(const std::vector<double>& X, double y, double lr) {
        auto [H1, Yhat] = forward(X);
        // delta2
        std::vector<double> delta2(W2.size());
        for (size_t i=0;i<W2.size();i++) delta2[i]=(Yhat[i]-y)*sigPrime(Yhat[i]);
        // update W2, b2
        for (size_t i=0;i<W2.size();i++){
            for (size_t j=0;j<H1.size();j++) W2[i][j]-=lr*delta2[i]*H1[j];
            b2[i]-=lr*delta2[i];
        }
        // delta1
        std::vector<double> delta1(W1.size());
        for (size_t j=0;j<W1.size();j++){
            double s=0;
            for (size_t i=0;i<W2.size();i++) s+=W2[i][j]*delta2[i];
            delta1[j]=s*sigPrime(H1[j]);
        }
        // update W1, b1
        for (size_t j=0;j<W1.size();j++){
            for (size_t k=0;k<X.size();k++) W1[j][k]-=lr*delta1[j]*X[k];
            b1[j]-=lr*delta1[j];
        }
    }
};

int main() {
    MLP net(2, 4, 1);
    std::vector<std::vector<double>> Xs{{0,0},{0,1},{1,0},{1,1}};
    std::vector<double> ys{0, 1, 1, 0};
    for (int ep = 0; ep < 500; ep++) {
        double loss = 0;
        for (size_t i = 0; i < Xs.size(); i++) {
            net.trainStep(Xs[i], ys[i], 1.0);
            auto [H1, Yhat] = net.forward(Xs[i]);
            double d = Yhat[0] - ys[i]; loss += 0.5*d*d;
        }
        if ((ep+1) % 100 == 0)
            std::cout << "Epoch " << ep+1 << "  loss=" << loss/4 << "\n";
    }
}

import java.util.Random;

public class TrainingLoop {

    static double sigmoid(double z)  { return 1.0 / (1.0 + Math.exp(-z)); }
    static double sigPrime(double a) { return a * (1.0 - a); }

    static double[] forward1(double[] X, double[][] W, double[] b) {
        double[] out = new double[W.length];
        for (int i = 0; i < W.length; i++) {
            out[i] = b[i];
            for (int j = 0; j < X.length; j++) out[i] += W[i][j] * X[j];
            out[i] = sigmoid(out[i]);
        }
        return out;
    }

    static void trainStep(
            double[] X, double y,
            double[][] W1, double[] b1,
            double[][] W2, double[] b2, double lr) {
        double[] H1   = forward1(X, W1, b1);
        double[] Yhat = forward1(H1, W2, b2);

        double[] delta2 = new double[Yhat.length];
        for (int i = 0; i < Yhat.length; i++)
            delta2[i] = (Yhat[i] - y) * sigPrime(Yhat[i]);

        for (int i = 0; i < W2.length; i++) {
            for (int j = 0; j < H1.length; j++) W2[i][j] -= lr * delta2[i] * H1[j];
            b2[i] -= lr * delta2[i];
        }

        double[] delta1 = new double[H1.length];
        for (int j = 0; j < H1.length; j++) {
            double s = 0;
            for (int i = 0; i < W2.length; i++) s += W2[i][j] * delta2[i];
            delta1[j] = s * sigPrime(H1[j]);
        }

        for (int j = 0; j < W1.length; j++) {
            for (int k = 0; k < X.length; k++) W1[j][k] -= lr * delta1[j] * X[k];
            b1[j] -= lr * delta1[j];
        }
    }

    public static void main(String[] args) {
        double[][] W1 = {{0.2,0.4},{0.3,0.1}};
        double[] b1   = {0.1, 0.1};
        double[][] W2 = {{0.5, 0.6}};
        double[] b2   = {0.2};
        double[] X    = {0.5, 0.8};
        double y      = 1.0;

        for (int ep = 1; ep <= 200; ep++) {
            trainStep(X, y, W1, b1, W2, b2, 0.5);
            if (ep % 20 == 0) {
                double[] H1   = forward1(X, W1, b1);
                double[] Yhat = forward1(H1, W2, b2);
                System.out.printf("Epoch %3d  yhat=%.4f%n", ep, Yhat[0]);
            }
        }
    }
}

package main

import (
    "fmt"
    "math"
)

func sigmoid(z float64) float64  { return 1.0 / (1.0 + math.Exp(-z)) }
func sigPrime(a float64) float64 { return a * (1.0 - a) }

func forwardLayer(X []float64, W [][]float64, b []float64) []float64 {
    out := make([]float64, len(W))
    for i, row := range W {
        out[i] = b[i]
        for j, w := range row { out[i] += w * X[j] }
        out[i] = sigmoid(out[i])
    }
    return out
}

func trainStep(X []float64, y float64,
    W1 [][]float64, b1 []float64,
    W2 [][]float64, b2 []float64, lr float64) {

    H1   := forwardLayer(X, W1, b1)
    Yhat := forwardLayer(H1, W2, b2)

    // delta2
    delta2 := make([]float64, len(Yhat))
    for i := range Yhat { delta2[i] = (Yhat[i] - y) * sigPrime(Yhat[i]) }

    // update W2, b2
    for i := range W2 {
        for j := range H1 { W2[i][j] -= lr * delta2[i] * H1[j] }
        b2[i] -= lr * delta2[i]
    }

    // delta1
    delta1 := make([]float64, len(H1))
    for j := range H1 {
        s := 0.0
        for i := range W2 { s += W2[i][j] * delta2[i] }
        delta1[j] = s * sigPrime(H1[j])
    }

    // update W1, b1
    for j := range W1 {
        for k := range X { W1[j][k] -= lr * delta1[j] * X[k] }
        b1[j] -= lr * delta1[j]
    }
}

func main() {
    W1 := [][]float64{{0.2, 0.4}, {0.3, 0.1}}
    b1 := []float64{0.1, 0.1}
    W2 := [][]float64{{0.5, 0.6}}
    b2 := []float64{0.2}
    X  := []float64{0.5, 0.8}
    y  := 1.0

    for ep := 1; ep <= 200; ep++ {
        trainStep(X, y, W1, b1, W2, b2, 0.5)
        if ep%20 == 0 {
            H1   := forwardLayer(X, W1, b1)
            Yhat := forwardLayer(H1, W2, b2)
            fmt.Printf("Epoch %3d  yhat=%.4f\n", ep, Yhat[0])
        }
    }
}

12. Multi-iteration training demo

Let's put it all together. This figure performs full training iterations (forward + backward + update) on the same network, showing how the loss decreases and the output $\hat{y}$ drifts toward the target. Each "Step" is one complete iteration.

▸ Training — watch the loss decrease Iteration 0

READY

Press Step → to run one full training iteration. The loss graph shows how $L$ evolves. Try different learning rates to see oscillation, slow convergence, or divergence.

η (lr) x₁ x₂ target y

13. Pitfalls

Vanishing gradients

Notice that every backward step multiplies by $\sigma'(z) = \sigma(z)(1-\sigma(z))$, which is at most $\frac{1}{4}$ — and is much smaller when $z$ is far from 0. In a deep network, the gradient at the first layer is a product of many such small numbers, and it shrinks exponentially with depth. This was why deep networks famously didn't train for decades. Fixes:

ReLU activation: $\text{ReLU}(z) = \max(0, z)$, whose derivative is $1$ for $z > 0$. No shrinkage, just a hard cutoff at zero.
Residual connections (ResNet, 2015): add a shortcut $h^{(l+1)} = h^{(l)} + f(h^{(l)})$. The gradient can flow directly through the shortcut without being multiplied by anything, preserving its magnitude.
Careful weight initialization (Xavier, He) to keep $z$ in the active region of the activation.
LayerNorm / BatchNorm to rescale activations between layers.

Exploding gradients

The opposite failure: the product of backward multipliers is too large and gradients grow without bound, causing NaNs. Fixes: gradient clipping ("if $\|\nabla\| > c$, rescale to $c$"), smaller learning rates, better initialization.

Dead ReLUs

If a ReLU's input goes strongly negative, its gradient is zero, and the weights feeding it receive no update — forever. The neuron is "dead." Leaky ReLU, GELU, and SiLU/Swish activations fix this by having a small nonzero slope in the negative region.

You shouldn't need to debug backprop itself

Every modern framework implements autograd. You almost never write a backward pass by hand. What you do debug is the effects of backprop: loss curves, gradient norms, activation statistics. The goal of understanding backprop is to read those signals.

14. Summary

ONE-PARAGRAPH SUMMARY

Backpropagation computes $\partial L / \partial w$ for every weight in a network in a single pass that costs about as much as one forward pass. It works by recognizing that the network is a composition of functions, so the chain rule applies: if you start at the loss and march backward, multiplying local derivatives as you go, you end up with the gradient of the loss with respect to every intermediate quantity, including every weight. The four BP equations (output error, backprop through a layer, bias gradient, weight gradient) are the recipe. Everything else in deep learning — optimizers, architectures, regularization — builds on top of that primitive.

Where to go next

→ Gradient Descent deep dive — the other half of the training loop: what do we actually do with these gradients?
→ Perceptron deep dive — where it all began: the learning rule for a single neuron, predating backprop by three decades.
→ Self-Attention — how backprop flows through a transformer layer.
Michael Nielsen's Chapter 2 — the classic textbook explanation of the four BP equations.
Karpathy's Yes you should understand backprop — the failure modes you only see after building intuition.
3Blue1Brown: What is backpropagation really doing? — beautiful visual companion to this derivation.