RNNs & LSTMs

How do you fit a variable-length sequence into a fixed-size neural network? The answer that held from 1986 until the Transformer dethroned it in 2017 was: keep a running state, and update it one step at a time.

Prereq: feedforward nets, chain rule Time to read: ~25 min Interactive figures: 1 Code: Python, JAX, PyTorch, C

1. Why sequences need a different model

A feedforward network takes a fixed-size vector in and produces a fixed-size vector out. That is fine for a 28×28 MNIST digit (always 784 pixels) or a tabular row (always the same columns), but most of what humans actually produce — speech, language, music, DNA, stock prices — arrives as a sequence of unknown length. You can't just flatten a sentence into a vector: a three-word sentence and a thirty-word sentence have genuinely different shapes, and padding everything to the longest sentence in the corpus wastes enormous amounts of compute on blank space.

The insight behind recurrent networks is that a sequence can be processed one element at a time if you carry a hidden state forward from step to step. At each step, the network sees the current input plus the hidden state produced by the previous step, and produces both an output and a new hidden state. The hidden state is a summary of everything the network has seen so far, compressed into a fixed-size vector.

THE CORE IDEA

Unroll a loop. A recurrent network is a single small neural network, applied over and over with shared weights, passing a running summary forward. The "depth" of the computation grows with the length of the sequence, but the number of learnable parameters stays constant.

2. The vanilla RNN

The simplest recurrent network is due to Elman (1990), though the idea had been floating around since Rumelhart, Hinton & Williams (1986). At every time step $t$, it computes:

h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)

y_t = W_{hy} h_t + b_y

What each symbol means

$x_t$: Input vector at time step $t$. For text, this is usually a word or token embedding; for audio, a frame of features. The sequence's "next thing to look at" — one element of the input stream.
$h_t$: Hidden state at time $t$. A fixed-size vector (size is a hyperparameter, e.g. 256 or 1024). The network's running memory of what it has seen so far.
$h_{t-1}$: Hidden state from the previous step. At $t=0$ this is usually the zero vector. The baton being handed forward through time.
$W_{hh}$: Square matrix that mixes the old hidden state into the new one. Shape: hidden × hidden. "How does what I already remember influence what I now think?"
$W_{xh}$: Matrix that projects the input into the hidden space. Shape: hidden × input. "How does this new token change my memory?"
$W_{hy}$: Matrix that reads the hidden state out into an output. Shape: output × hidden. "Given my current state, what is the prediction?"
$b_h, b_y$: Bias vectors — a per-unit constant offset, so a neuron can fire even with zero input.
$\tanh$: The hyperbolic tangent, applied elementwise. Squashes each coordinate into $(-1, 1)$. Keeps the hidden state bounded so it can't explode over hundreds of steps.

Analogy Imagine reading a novel with a notepad. $x_t$ is the next sentence. $h_{t-1}$ is your notes so far. $W_{xh} x_t$ says "here is what this new sentence adds." $W_{hh} h_{t-1}$ says "here is how my existing notes get reshuffled." You add them, pass through $\tanh$ ("keep everything on a sensible scale"), and that's your updated notepad $h_t$. To guess what happens next in the plot, you look at your notes through the lens $W_{hy}$.

Critically, the same matrices $W_{hh}, W_{xh}, W_{hy}$ are used at every time step. That's weight sharing across time, and it's the reason a single RNN can handle sequences of any length: the number of parameters doesn't grow with the length of the input.

An RNN unrolled over four time steps. The cell is physically one function with one set of weights; the "depth" is illusory — we're just applying it repeatedly, threading the hidden state through.

Source — RNN Forward Pass

// RNN cell: h_t = tanh(W_h * h_{t-1} + W_x * x_t + b_h)
//           y_t = W_y * h_t + b_y

function rnn_cell(x_t, h_prev, W_x, W_h, b_h):
    return tanh(matmul(W_h, h_prev) + matmul(W_x, x_t) + b_h)

function rnn_forward(xs, h0, W_x, W_h, W_y, b_h, b_y):
    h = h0
    ys = []
    hs = []
    for x_t in xs:
        h = rnn_cell(x_t, h, W_x, W_h, b_h)
        y = matmul(W_y, h) + b_y
        ys.append(y)
        hs.append(h)
    return ys, hs

import math

def tanh_vec(v):
    return [math.tanh(z) for z in v]

def matvec(W, x):
    return [sum(Wij * xj for Wij, xj in zip(row, x)) for row in W]

def vec_add(*vecs):
    return [sum(vals) for vals in zip(*vecs)]

def rnn_cell(x_t, h_prev, Wxh, Whh, bh):
    """Single RNN step. Returns new hidden state h_t."""
    return tanh_vec(vec_add(matvec(Whh, h_prev), matvec(Wxh, x_t), bh))

def rnn_forward(xs, h0, Wxh, Whh, Why, bh, by):
    """Unrolled RNN forward pass over a sequence of inputs.
    xs:  list of input vectors
    h0:  initial hidden state (zeros usually)
    Returns (list of outputs, list of hidden states).
    """
    h = h0
    ys, hs = [], []
    for x_t in xs:
        h = rnn_cell(x_t, h, Wxh, Whh, bh)
        y = vec_add(matvec(Why, h), by)
        ys.append(y)
        hs.append(h)
    return ys, hs

function matvec(W, x) {
  return W.map(row => row.reduce((s, w, j) => s + w * x[j], 0));
}
function vecAdd(a, b) { return a.map((v, i) => v + b[i]); }

function rnnCell(xt, hPrev, Wxh, Whh, bh) {
  const z = vecAdd(vecAdd(matvec(Whh, hPrev), matvec(Wxh, xt)), bh);
  return z.map(v => Math.tanh(v));
}

function rnnForward(xs, h0, Wxh, Whh, Why, bh, by) {
  let h = h0.slice();
  const ys = [], hs = [];
  for (const xt of xs) {
    h = rnnCell(xt, h, Wxh, Whh, bh);
    ys.push(vecAdd(matvec(Why, h), by));
    hs.push(h.slice());
  }
  return { ys, hs };
}

#include <math.h>
#include <string.h>

/* Single RNN step. I=input dim, H=hidden dim.
   Wxh[H*I], Whh[H*H], bh[H], h_out[H]. */
void rnn_cell(const float *x, const float *h_prev,
              const float *Wxh, const float *Whh, const float *bh,
              int I, int H, float *h_out) {
    for (int k = 0; k < H; k++) {
        float z = bh[k];
        for (int j = 0; j < H; j++) z += Whh[k*H + j] * h_prev[j];
        for (int j = 0; j < I; j++) z += Wxh[k*I + j] * x[j];
        h_out[k] = tanhf(z);
    }
}

/* Full forward pass. xs[T*I], ys_out[T*O], hs_out[T*H]. */
void rnn_forward(const float *xs, int T, int I,
                 const float *Wxh, const float *Whh,
                 const float *Why, const float *bh, const float *by,
                 int H, int O,
                 float *ys_out, float *hs_out) {
    float h[H];
    memset(h, 0, sizeof(float) * H);
    for (int t = 0; t < T; t++) {
        float h_new[H];
        rnn_cell(xs + t*I, h, Wxh, Whh, bh, I, H, h_new);
        memcpy(h, h_new, sizeof(float) * H);
        memcpy(hs_out + t*H, h, sizeof(float) * H);
        for (int k = 0; k < O; k++) {
            float z = by[k];
            for (int j = 0; j < H; j++) z += Why[k*H + j] * h[j];
            ys_out[t*O + k] = z;
        }
    }
}

#include <vector>
#include <cmath>

using Vec = std::vector<float>;
using Mat = std::vector<Vec>;

Vec matvec(const Mat& W, const Vec& x) {
    Vec out(W.size(), 0.0f);
    for (size_t i = 0; i < W.size(); i++)
        for (size_t j = 0; j < x.size(); j++)
            out[i] += W[i][j] * x[j];
    return out;
}

Vec rnnCell(const Vec& xt, const Vec& hPrev,
            const Mat& Wxh, const Mat& Whh, const Vec& bh) {
    Vec z(bh.size());
    auto a = matvec(Whh, hPrev), b = matvec(Wxh, xt);
    for (size_t i = 0; i < z.size(); i++)
        z[i] = std::tanh(a[i] + b[i] + bh[i]);
    return z;
}

std::pair<std::vector<Vec>, std::vector<Vec>>
rnnForward(const std::vector<Vec>& xs, Vec h,
           const Mat& Wxh, const Mat& Whh,
           const Mat& Why, const Vec& bh, const Vec& by) {
    std::vector<Vec> ys, hs;
    for (const auto& xt : xs) {
        h = rnnCell(xt, h, Wxh, Whh, bh);
        auto y = matvec(Why, h);
        for (size_t i = 0; i < y.size(); i++) y[i] += by[i];
        ys.push_back(y);
        hs.push_back(h);
    }
    return {ys, hs};
}

public class VanillaRNN {
    private static double[] matvec(double[][] W, double[] x) {
        double[] out = new double[W.length];
        for (int i = 0; i < W.length; i++)
            for (int j = 0; j < x.length; j++)
                out[i] += W[i][j] * x[j];
        return out;
    }

    public static double[] rnnCell(double[] xt, double[] hPrev,
                                    double[][] Wxh, double[][] Whh, double[] bh) {
        double[] a = matvec(Whh, hPrev), b = matvec(Wxh, xt);
        double[] h = new double[bh.length];
        for (int i = 0; i < h.length; i++)
            h[i] = Math.tanh(a[i] + b[i] + bh[i]);
        return h;
    }

    public static double[][][] forward(double[][] xs, double[] h0,
                                        double[][] Wxh, double[][] Whh,
                                        double[][] Why, double[] bh, double[] by) {
        double[] h = h0.clone();
        double[][] ys = new double[xs.length][], hs = new double[xs.length][];
        for (int t = 0; t < xs.length; t++) {
            h = rnnCell(xs[t], h, Wxh, Whh, bh);
            hs[t] = h.clone();
            double[] y = matvec(Why, h);
            for (int i = 0; i < y.length; i++) y[i] += by[i];
            ys[t] = y;
        }
        return new double[][][] {ys, hs};
    }
}

import "math"

func matvec(W [][]float64, x []float64) []float64 {
    out := make([]float64, len(W))
    for i, row := range W {
        for j, wij := range row {
            out[i] += wij * x[j]
        }
    }
    return out
}

func rnnCell(xt, hPrev []float64, Wxh, Whh [][]float64, bh []float64) []float64 {
    a, b := matvec(Whh, hPrev), matvec(Wxh, xt)
    h := make([]float64, len(bh))
    for i := range h {
        h[i] = math.Tanh(a[i] + b[i] + bh[i])
    }
    return h
}

func RNNForward(xs [][]float64, h0 []float64,
    Wxh, Whh, Why [][]float64, bh, by []float64) ([][]float64, [][]float64) {
    h := append([]float64{}, h0...)
    var ys, hs [][]float64
    for _, xt := range xs {
        h = rnnCell(xt, h, Wxh, Whh, bh)
        y := matvec(Why, h)
        for i := range y { y[i] += by[i] }
        ys = append(ys, y)
        hs = append(hs, append([]float64{}, h...))
    }
    return ys, hs
}

3. Backprop through time (BPTT)

How do you train such a thing? The same way you train any network: compute a loss, take gradients, update the weights. The wrinkle is that a parameter like $W_{hh}$ is used at every step, so when you differentiate the total loss, you have to sum contributions across every step it touched. This is called backpropagation through time.

Let $\mathcal{L} = \sum_t \ell_t(y_t, y_t^*)$ be the total loss over the sequence. Then:

\frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t=1}^{T} \sum_{k=1}^{t} \frac{\partial \ell_t}{\partial h_t} \left( \prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}} \right) \frac{\partial h_k}{\partial W_{hh}}

Reading the BPTT formula

$T$: The length of the sequence.
$\ell_t$: Per-step loss (e.g. cross-entropy between $y_t$ and the true label $y_t^*$).
$\dfrac{\partial \ell_t}{\partial h_t}$: "If I nudged the hidden state at step $t$, how much would the step-$t$ loss change?"
$\dfrac{\partial h_j}{\partial h_{j-1}}$: The Jacobian of one step of the recurrence — how the hidden state at step $j$ depends on the hidden state at step $j-1$. For a vanilla RNN, this equals $W_{hh}^{\top} \cdot \text{diag}(\tanh'(\cdot))$.
$\prod_{j=k+1}^{t} \dfrac{\partial h_j}{\partial h_{j-1}}$: A product of $t-k$ Jacobians — one for every step the error has to "travel" from $t$ back to $k$. This is the term that blows up or collapses. Keep reading.

Analogy Think of gradient signal like a whisper passed down a line of people. Each person garbles the message a bit (multiplies it by their local Jacobian). After 50 people the original whisper is either deafening (exploded) or inaudible (vanished). BPTT is the whisper game, and the Jacobian product is why the message rarely survives intact over long distances.

Source — BPTT

// Truncated BPTT for a sequence of length T.
// Loss: mean squared error between y_t and target t_t.

forward_pass(xs, h0) -> (ys, hs)        // store all hidden states

dWhh = 0, dWxh = 0, dWhy = 0
dbh  = 0, dby   = 0
dh_next = zeros(H)

for t = T-1 downto 0:
    dy = 2 * (ys[t] - targets[t]) / T    // dL/dy_t
    dWhy += outer(dy, hs[t])
    dby  += dy
    dh = matmul(Why.T, dy) + dh_next     // gradient into hidden state
    dh_raw = dh * (1 - hs[t]^2)          // tanh derivative
    dbh  += dh_raw
    dWxh += outer(dh_raw, xs[t])
    dWhh += outer(dh_raw, hs[t-1])       // hs[-1] = h0
    dh_next = matmul(Whh.T, dh_raw)
    clip gradients to [-clip, clip]      // prevent explosion

update: W -= lr * dW for each W

import math

def clip(v, threshold=5.0):
    return max(-threshold, min(threshold, v))

def bptt(xs, targets, Wxh, Whh, Why, bh, by, h0, lr=1e-2):
    """Minimal truncated BPTT for a vanilla RNN.
    xs, targets: lists of vectors (same length T).
    Returns updated (Wxh, Whh, Why, bh, by).
    """
    T = len(xs)
    H = len(bh)
    O = len(by)

    # Forward pass — store hidden states
    hs = [None] * (T + 1)
    ys = [None] * T
    hs[-1] = h0[:]
    for t in range(T):
        z = [sum(Whh[k][j] * hs[t-1][j] for j in range(H)) +
             sum(Wxh[k][j] * xs[t][j]   for j in range(len(xs[t]))) +
             bh[k] for k in range(H)]
        hs[t] = [math.tanh(z[k]) for k in range(H)]
        ys[t] = [sum(Why[k][j] * hs[t][j] for j in range(H)) + by[k]
                 for k in range(O)]

    # Backward pass
    dWhh = [[0.0]*H   for _ in range(H)]
    dWxh = [[0.0]*len(xs[0]) for _ in range(H)]
    dWhy = [[0.0]*H   for _ in range(O)]
    dbh  = [0.0] * H
    dby  = [0.0] * O
    dh_next = [0.0] * H

    for t in reversed(range(T)):
        dy = [(ys[t][k] - targets[t][k]) * 2.0 / T for k in range(O)]
        for k in range(O):
            dby[k] += dy[k]
            for j in range(H):
                dWhy[k][j] += dy[k] * hs[t][j]

        dh = [sum(Why[k][j] * dy[k] for k in range(O)) + dh_next[j]
              for j in range(H)]
        dh_raw = [dh[k] * (1.0 - hs[t][k] ** 2) for k in range(H)]

        for k in range(H):
            dbh[k] += dh_raw[k]
            for j in range(len(xs[t])):
                dWxh[k][j] += dh_raw[k] * xs[t][j]
            for j in range(H):
                dWhh[k][j] += dh_raw[k] * hs[t-1][j]

        dh_next = [sum(Whh[j][k] * dh_raw[j] for j in range(H))
                   for k in range(H)]

    # Gradient clipping + SGD update (in-place)
    for W, dW in [(Whh, dWhh), (Wxh, dWxh), (Why, dWhy)]:
        for i in range(len(W)):
            for j in range(len(W[i])):
                W[i][j] -= lr * clip(dW[i][j])
    for b, db in [(bh, dbh), (by, dby)]:
        for i in range(len(b)):
            b[i] -= lr * clip(db[i])

    return Wxh, Whh, Why, bh, by

function bptt(xs, targets, params, lr = 1e-2, clip = 5) {
  const { Wxh, Whh, Why, bh, by } = params;
  const T = xs.length, H = bh.length, O = by.length;

  // Forward
  const hs = [new Array(H).fill(0)];
  const ys = [];
  for (let t = 0; t < T; t++) {
    const z = bh.map((b, k) =>
      Whh[k].reduce((s, w, j) => s + w * hs[t][j], 0) +
      Wxh[k].reduce((s, w, j) => s + w * xs[t][j], 0) + b);
    hs.push(z.map(Math.tanh));
    ys.push(Why.map((row, k) => row.reduce((s, w, j) => s + w * hs[t+1][j], 0) + by[k]));
  }

  // Backward
  const dWhh = Whh.map(r => r.map(() => 0));
  const dWxh = Wxh.map(r => r.map(() => 0));
  const dWhy = Why.map(r => r.map(() => 0));
  const dbh = new Array(H).fill(0), dby = new Array(O).fill(0);
  let dhNext = new Array(H).fill(0);

  for (let t = T - 1; t >= 0; t--) {
    const dy = ys[t].map((y, k) => 2 * (y - targets[t][k]) / T);
    dy.forEach((d, k) => {
      dby[k] += d;
      Why[k].forEach((_, j) => { dWhy[k][j] += d * hs[t+1][j]; });
    });
    const dh = hs[t+1].map((_, j) =>
      Why.reduce((s, row, k) => s + row[j] * dy[k], 0) + dhNext[j]);
    const dhRaw = dh.map((d, k) => d * (1 - hs[t+1][k] ** 2));
    dhRaw.forEach((d, k) => {
      dbh[k] += d;
      xs[t].forEach((x, j) => { dWxh[k][j] += d * x; });
      hs[t].forEach((h, j) => { dWhh[k][j] += d * h; });
    });
    dhNext = hs[t].map((_, j) =>
      Whh.reduce((s, row, k) => s + row[j] * dhRaw[k], 0));
  }

  const c = v => Math.max(-clip, Math.min(clip, v));
  const update = (W, dW) => W.forEach((r, i) => r.forEach((_, j) => { W[i][j] -= lr * c(dW[i][j]); }));
  [Whh, dWhh, Wxh, dWxh, Why, dWhy].reduce((_, v, i, a) => i % 2 === 0 ? update(a[i], a[i+1]) : null);
  [bh, dbh, by, dby].reduce((_, v, i, a) => i % 2 === 0 ? a[i].forEach((_, j) => { a[i][j] -= lr * c(a[i+1][j]); }) : null);
  return params;
}

/* Truncated BPTT — conceptual C sketch.
   For brevity, matrix allocs use VLAs; clip applied per gradient. */
#include <math.h>
#include <string.h>

static float clamp(float v, float t) {
    return v > t ? t : (v < -t ? -t : v);
}

/* Assumes forward pass has already stored hs[0..T] and ys[0..T-1].
   Updates Whh, Wxh, Why, bh, by in-place. */
void bptt_update(
    const float *xs,      /* [T*I] */
    const float *targets, /* [T*O] */
    const float *hs,      /* [(T+1)*H] — hs[0] = h0 */
    const float *ys,      /* [T*O] */
    float *Wxh, float *Whh, float *Why,
    float *bh, float *by,
    int T, int I, int H, int O,
    float lr, float clip_val) {

    float dWhh[H*H], dWxh[H*I], dWhy[O*H], dbh[H], dby[O];
    memset(dWhh, 0, sizeof dWhh);  memset(dWxh, 0, sizeof dWxh);
    memset(dWhy, 0, sizeof dWhy);  memset(dbh,  0, sizeof dbh);
    memset(dby,  0, sizeof dby);
    float dh_next[H];
    memset(dh_next, 0, sizeof dh_next);

    for (int t = T-1; t >= 0; t--) {
        const float *ht  = hs + (t+1)*H;
        const float *htm = hs + t*H;
        const float *xt  = xs + t*I;
        const float *tgt = targets + t*O;
        const float *yt  = ys + t*O;

        float dy[O];
        for (int k = 0; k < O; k++) {
            dy[k] = 2.0f * (yt[k] - tgt[k]) / T;
            dby[k] += dy[k];
            for (int j = 0; j < H; j++) dWhy[k*H+j] += dy[k] * ht[j];
        }
        float dh[H];
        for (int j = 0; j < H; j++) {
            dh[j] = dh_next[j];
            for (int k = 0; k < O; k++) dh[j] += Why[k*H+j] * dy[k];
        }
        float dh_raw[H];
        for (int k = 0; k < H; k++) {
            dh_raw[k] = dh[k] * (1.0f - ht[k]*ht[k]);
            dbh[k] += dh_raw[k];
            for (int j = 0; j < I; j++) dWxh[k*I+j] += dh_raw[k] * xt[j];
            for (int j = 0; j < H; j++) dWhh[k*H+j] += dh_raw[k] * htm[j];
        }
        for (int j = 0; j < H; j++) {
            dh_next[j] = 0;
            for (int k = 0; k < H; k++) dh_next[j] += Whh[k*H+j] * dh_raw[k];
        }
    }
    /* SGD update with gradient clipping */
    for (int i = 0; i < H*H; i++) Whh[i] -= lr * clamp(dWhh[i], clip_val);
    for (int i = 0; i < H*I; i++) Wxh[i] -= lr * clamp(dWxh[i], clip_val);
    for (int i = 0; i < O*H; i++) Why[i] -= lr * clamp(dWhy[i], clip_val);
    for (int i = 0; i < H;   i++) bh[i]  -= lr * clamp(dbh[i],  clip_val);
    for (int i = 0; i < O;   i++) by[i]  -= lr * clamp(dby[i],  clip_val);
}

// C++ BPTT — mirrors the pseudocode closely using std::vector.
// Forward pass is assumed; this function takes stored hs and ys.
#include <vector>
#include <cmath>
#include <algorithm>

using Vec = std::vector<float>;
using Mat = std::vector<Vec>;

static float clamp(float v, float t) { return std::max(-t, std::min(t, v)); }

void bptt(const std::vector<Vec>& xs, const std::vector<Vec>& targets,
          const std::vector<Vec>& hs,  // hs[0]=h0, hs[1..T]=step outputs
          const std::vector<Vec>& ys,
          Mat& Wxh, Mat& Whh, Mat& Why,
          Vec& bh,  Vec& by,
          float lr = 1e-2f, float clipVal = 5.0f) {
    int T = xs.size(), H = bh.size(), O = by.size();
    Mat dWhh(H, Vec(H, 0)), dWxh(H, Vec(xs[0].size(), 0)), dWhy(O, Vec(H, 0));
    Vec dbh(H, 0), dby(O, 0), dhNext(H, 0);

    for (int t = T-1; t >= 0; t--) {
        Vec dy(O);
        for (int k = 0; k < O; k++) {
            dy[k] = 2.0f * (ys[t][k] - targets[t][k]) / T;
            dby[k] += dy[k];
            for (int j = 0; j < H; j++) dWhy[k][j] += dy[k] * hs[t+1][j];
        }
        Vec dh(H, 0);
        for (int j = 0; j < H; j++) {
            dh[j] = dhNext[j];
            for (int k = 0; k < O; k++) dh[j] += Why[k][j] * dy[k];
        }
        Vec dhRaw(H);
        for (int k = 0; k < H; k++) {
            dhRaw[k] = dh[k] * (1.0f - hs[t+1][k] * hs[t+1][k]);
            dbh[k] += dhRaw[k];
            for (size_t j = 0; j < xs[t].size(); j++) dWxh[k][j] += dhRaw[k] * xs[t][j];
            for (int j = 0; j < H; j++) dWhh[k][j] += dhRaw[k] * hs[t][j];
        }
        std::fill(dhNext.begin(), dhNext.end(), 0);
        for (int j = 0; j < H; j++)
            for (int k = 0; k < H; k++)
                dhNext[j] += Whh[k][j] * dhRaw[k];
    }
    auto applyUpdate = [&](Mat& W, const Mat& dW) {
        for (size_t i = 0; i < W.size(); i++)
            for (size_t j = 0; j < W[i].size(); j++)
                W[i][j] -= lr * clamp(dW[i][j], clipVal);
    };
    applyUpdate(Whh, dWhh); applyUpdate(Wxh, dWxh); applyUpdate(Why, dWhy);
    for (int i = 0; i < H; i++) bh[i] -= lr * clamp(dbh[i], clipVal);
    for (int i = 0; i < O; i++) by[i] -= lr * clamp(dby[i], clipVal);
}

/** Truncated BPTT for a vanilla RNN.
 *  Assumes rnnForward has already been called to produce hs and ys. */
public class BPTT {
    static float clamp(float v, float t) { return Math.max(-t, Math.min(t, v)); }

    public static void update(float[][] xs, float[][] targets,
                               float[][] hs,   // hs[0]=h0 .. hs[T]=h_T
                               float[][] ys,
                               float[][] Wxh, float[][] Whh, float[][] Why,
                               float[] bh, float[] by,
                               float lr, float clipVal) {
        int T = xs.length, H = bh.length, O = by.length, I = xs[0].length;
        float[][] dWhh = new float[H][H], dWxh = new float[H][I],
                  dWhy = new float[O][H];
        float[] dbh = new float[H], dby = new float[O], dhNext = new float[H];

        for (int t = T-1; t >= 0; t--) {
            float[] dy = new float[O];
            for (int k = 0; k < O; k++) {
                dy[k] = 2f * (ys[t][k] - targets[t][k]) / T;
                dby[k] += dy[k];
                for (int j = 0; j < H; j++) dWhy[k][j] += dy[k] * hs[t+1][j];
            }
            float[] dh = new float[H];
            for (int j = 0; j < H; j++) {
                dh[j] = dhNext[j];
                for (int k = 0; k < O; k++) dh[j] += Why[k][j] * dy[k];
            }
            float[] dhRaw = new float[H];
            for (int k = 0; k < H; k++) {
                dhRaw[k] = dh[k] * (1f - hs[t+1][k] * hs[t+1][k]);
                dbh[k] += dhRaw[k];
                for (int j = 0; j < I; j++) dWxh[k][j] += dhRaw[k] * xs[t][j];
                for (int j = 0; j < H; j++) dWhh[k][j] += dhRaw[k] * hs[t][j];
            }
            dhNext = new float[H];
            for (int j = 0; j < H; j++)
                for (int k = 0; k < H; k++)
                    dhNext[j] += Whh[k][j] * dhRaw[k];
        }
        for (int i = 0; i < H; i++) for (int j = 0; j < H; j++) Whh[i][j] -= lr * clamp(dWhh[i][j], clipVal);
        for (int i = 0; i < H; i++) for (int j = 0; j < I; j++) Wxh[i][j] -= lr * clamp(dWxh[i][j], clipVal);
        for (int i = 0; i < O; i++) for (int j = 0; j < H; j++) Why[i][j] -= lr * clamp(dWhy[i][j], clipVal);
        for (int i = 0; i < H; i++) bh[i] -= lr * clamp(dbh[i], clipVal);
        for (int i = 0; i < O; i++) by[i] -= lr * clamp(dby[i], clipVal);
    }
}

import (
    "math"
)

func clamp(v, t float64) float64 {
    if v > t { return t }
    if v < -t { return -t }
    return v
}

// BPTTUpdate performs one truncated-BPTT update step.
// hs has length T+1 (hs[0] = h0).
func BPTTUpdate(xs, targets [][]float64,
    hs, ys [][]float64,
    Wxh, Whh, Why [][]float64,
    bh, by []float64,
    lr, clipVal float64) {
    T, H, O := len(xs), len(bh), len(by)
    I := len(xs[0])

    dWhh := make2D(H, H); dWxh := make2D(H, I); dWhy := make2D(O, H)
    dbh := make([]float64, H); dby := make([]float64, O)
    dhNext := make([]float64, H)

    for t := T - 1; t >= 0; t-- {
        dy := make([]float64, O)
        for k := 0; k < O; k++ {
            dy[k] = 2 * (ys[t][k] - targets[t][k]) / float64(T)
            dby[k] += dy[k]
            for j := 0; j < H; j++ { dWhy[k][j] += dy[k] * hs[t+1][j] }
        }
        dh := make([]float64, H)
        for j := 0; j < H; j++ {
            dh[j] = dhNext[j]
            for k := 0; k < O; k++ { dh[j] += Why[k][j] * dy[k] }
        }
        dhRaw := make([]float64, H)
        for k := 0; k < H; k++ {
            dhRaw[k] = dh[k] * (1 - hs[t+1][k]*hs[t+1][k])
            dbh[k] += dhRaw[k]
            for j := 0; j < I; j++ { dWxh[k][j] += dhRaw[k] * xs[t][j] }
            for j := 0; j < H; j++ { dWhh[k][j] += dhRaw[k] * hs[t][j] }
        }
        dhNext = make([]float64, H)
        for j := 0; j < H; j++ {
            for k := 0; k < H; k++ { dhNext[j] += Whh[k][j] * dhRaw[k] }
        }
    }
    applyUpdate(Whh, dWhh, lr, clipVal)
    applyUpdate(Wxh, dWxh, lr, clipVal)
    applyUpdate(Why, dWhy, lr, clipVal)
    for i := range bh { bh[i] -= lr * clamp(dbh[i], clipVal) }
    for i := range by { by[i] -= lr * clamp(dby[i], clipVal) }
}

func make2D(r, c int) [][]float64 {
    m := make([][]float64, r)
    for i := range m { m[i] = make([]float64, c) }
    return m
}

func applyUpdate(W, dW [][]float64, lr, clipVal float64) {
    for i := range W {
        for j := range W[i] { W[i][j] -= lr * clamp(dW[i][j], clipVal) }
    }
}

var _ = math.Tanh // import used in rnnCell

4. The vanishing-gradient problem

The product of Jacobians is the trouble. If each Jacobian has spectral norm (largest singular value) $\sigma$, then the product of $n$ of them has norm roughly $\sigma^n$. For $\sigma < 1$, that tends to $0$ exponentially fast — the gradient vanishes. For $\sigma > 1$, it blows up exponentially — the gradient explodes. Either way, any dependency that spans more than a handful of steps is effectively invisible to the optimizer.

Bengio, Simard & Frasconi published the definitive analysis of this in 1994 ("Learning Long-Term Dependencies with Gradient Descent Is Difficult"). Their conclusion: vanilla RNNs, trained by gradient descent, simply cannot learn dependencies that span more than about 10–20 time steps. If you want a model to connect "The cats that the young girl with the green dress had been hoping to adopt… were from the shelter," you need a different architecture.

The vanishing-gradient problem

In a recurrent network, gradients flowing back through time pass through a product of Jacobians. Unless those Jacobians hover very near the identity, the product shrinks (or grows) exponentially with distance. Long-range dependencies therefore receive no useful learning signal.

Source — Vanishing Gradient Demo

// Show how gradient magnitude shrinks (or grows) over T steps
// for the scalar recurrence  h_t = w * h_{t-1}  (simplest RNN).
// The gradient of h_T w.r.t. h_0 is exactly w^T.

function vanishing_grad_demo(w, T):
    grad = 1.0
    for t in 1..T:
        grad = grad * w    // dh_T / dh_0 = w^T
        print(t, abs(grad))

// With |w| < 1  →  grad → 0 (vanishing)
// With |w| > 1  →  grad → ∞ (exploding)

def vanishing_gradient_demo(w=0.9, T=50):
    """
    Demonstrate gradient magnitude for the scalar RNN h_t = w * h_{t-1}.
    dL/dh_0 = dL/dh_T * w^T.
    """
    print(f"w = {w}, T = {T}")
    print(f"{'step':>5}  {'|grad|':>12}  {'note':<20}")
    grad = 1.0
    for t in range(1, T + 1):
        grad *= w
        note = ''
        if abs(grad) < 1e-4:
            note = '<-- near zero'
        elif abs(grad) > 1e4:
            note = '<-- exploded!'
        if t <= 10 or t % 10 == 0:
            print(f"{t:>5}  {abs(grad):>12.6f}  {note}")

# Try different w values:
vanishing_gradient_demo(w=0.9,  T=50)   # vanishes
vanishing_gradient_demo(w=1.05, T=50)   # explodes
vanishing_gradient_demo(w=1.0,  T=50)   # constant — perfect transmission

function vanishingGradDemo(w = 0.9, T = 50) {
  console.log(`w = ${w}, T = ${T}`);
  let grad = 1.0;
  for (let t = 1; t <= T; t++) {
    grad *= w;
    if (t <= 10 || t % 10 === 0) {
      const note = Math.abs(grad) < 1e-4 ? ' <-- near zero'
                 : Math.abs(grad) > 1e4  ? ' <-- exploded!'
                 : '';
      console.log(`step ${String(t).padStart(3)}: |grad| = ${Math.abs(grad).toFixed(6)}${note}`);
    }
  }
}

vanishingGradDemo(0.9,  50);
vanishingGradDemo(1.05, 50);
vanishingGradDemo(1.0,  50);

#include <stdio.h>
#include <math.h>

void vanishing_grad_demo(float w, int T) {
    printf("w = %.3f, T = %d\n", w, T);
    float grad = 1.0f;
    for (int t = 1; t <= T; t++) {
        grad *= w;
        if (t <= 10 || t % 10 == 0) {
            const char *note = fabsf(grad) < 1e-4f ? " <-- near zero"
                             : fabsf(grad) > 1e4f  ? " <-- exploded!"
                             : "";
            printf("  step %3d: |grad| = %12.6f%s\n", t, fabsf(grad), note);
        }
    }
}

int main(void) {
    vanishing_grad_demo(0.90f, 50);
    vanishing_grad_demo(1.05f, 50);
    vanishing_grad_demo(1.00f, 50);
    return 0;
}

#include <iostream>
#include <iomanip>
#include <cmath>

void vanishingGradDemo(float w, int T) {
    std::cout << "w = " << w << ", T = " << T << "\n";
    float grad = 1.0f;
    for (int t = 1; t <= T; t++) {
        grad *= w;
        if (t <= 10 || t % 10 == 0) {
            std::string note;
            if (std::abs(grad) < 1e-4f) note = " <-- near zero";
            else if (std::abs(grad) > 1e4f) note = " <-- exploded!";
            std::cout << "  step " << std::setw(3) << t
                      << ": |grad| = " << std::fixed << std::setprecision(6)
                      << std::abs(grad) << note << "\n";
        }
    }
}

int main() {
    vanishingGradDemo(0.90f, 50);
    vanishingGradDemo(1.05f, 50);
    vanishingGradDemo(1.00f, 50);
}

public class VanishingGrad {
    public static void demo(double w, int T) {
        System.out.printf("w = %.3f, T = %d%n", w, T);
        double grad = 1.0;
        for (int t = 1; t <= T; t++) {
            grad *= w;
            if (t <= 10 || t % 10 == 0) {
                String note = Math.abs(grad) < 1e-4 ? " <-- near zero"
                            : Math.abs(grad) > 1e4  ? " <-- exploded!" : "";
                System.out.printf("  step %3d: |grad| = %12.6f%s%n",
                                  t, Math.abs(grad), note);
            }
        }
    }

    public static void main(String[] args) {
        demo(0.90, 50);
        demo(1.05, 50);
        demo(1.00, 50);
    }
}

package main

import (
    "fmt"
    "math"
)

func vanishingGradDemo(w float64, T int) {
    fmt.Printf("w = %.3f, T = %d\n", w, T)
    grad := 1.0
    for t := 1; t <= T; t++ {
        grad *= w
        if t <= 10 || t%10 == 0 {
            note := ""
            if math.Abs(grad) < 1e-4 {
                note = " <-- near zero"
            } else if math.Abs(grad) > 1e4 {
                note = " <-- exploded!"
            }
            fmt.Printf("  step %3d: |grad| = %12.6f%s\n", t, math.Abs(grad), note)
        }
    }
}

func main() {
    vanishingGradDemo(0.90, 50)
    vanishingGradDemo(1.05, 50)
    vanishingGradDemo(1.00, 50)
}

5. The LSTM cell

Sepp Hochreiter and Jürgen Schmidhuber's 1997 Long Short-Term Memory paper proposed a fix: build the cell so that, by default, information doesn't get multiplied as it passes through time. Instead of a single hidden state, an LSTM maintains two things:

A cell state $c_t$: a "conveyor belt" that runs through time with only minor linear changes at each step. Because it's updated additively (not multiplicatively), gradients can flow through it without shrinking.
A hidden state $h_t$: a filtered view of the cell state, used as the cell's actual output and as the input to the next step's gate computations.

Gates, intuitively

Access to the cell state is controlled by three learnable gates. Each gate is a small sigmoid network that outputs a vector of numbers in $[0, 1]$. A $0$ means "block completely," a $1$ means "let through completely," and anything in between is a soft partial pass.

Forget gate $f_t$ — "Which parts of the old cell state should I erase?"
Input gate $i_t$ — "Which parts of the new candidate values should I write into the cell state?"
Output gate $o_t$ — "Which parts of the (possibly updated) cell state should I expose as the hidden state this step?"

Separating "what gets stored" from "what gets read" is the trick. The cell state is a kind of private memory; the gates decide how it's written, retained, and exposed.

Gates, precisely

\begin{aligned} f_t &= \sigma(W_f [h_{t-1}, x_t] + b_f) \quad &\text{(forget gate)} \\ i_t &= \sigma(W_i [h_{t-1}, x_t] + b_i) \quad &\text{(input gate)} \\ \tilde{c}_t &= \tanh(W_c [h_{t-1}, x_t] + b_c) \quad &\text{(candidate values)} \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}_t \quad &\text{(cell state update)} \\ o_t &= \sigma(W_o [h_{t-1}, x_t] + b_o) \quad &\text{(output gate)} \\ h_t &= o_t \odot \tanh(c_t) \quad &\text{(hidden state)} \end{aligned}

Every symbol, unpacked

$\sigma$: The logistic sigmoid $\sigma(z) = 1/(1+e^{-z})$, applied elementwise. Squashes any real number into $(0, 1)$. Perfect for a gate: the output is naturally a "how much should I let through" fraction.
$\odot$: Elementwise (Hadamard) product. $(a \odot b)_i = a_i b_i$. This is how a gate vector "multiplies" onto another vector — each gate coordinate controls its own memory slot.
$[h_{t-1}, x_t]$: Concatenation of the previous hidden state and the current input into one long vector. The shorthand lets each gate look at both "what was I thinking?" and "what's happening now?" in a single matrix multiply.
$c_t$: Cell state at time $t$. This is the long-term memory conveyor belt. Note how it's updated: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$. No $W$ multiplied into the old state — just an elementwise scaling by a number in $[0,1]$. That's why gradients can ride this line a long way.
$\tilde{c}_t$: "Candidate" new memory — the values the network would like to write in if the input gate lets it.
$f_t$: Forget gate. $f_t \approx 1$ means "keep the old cell state almost unchanged"; $f_t \approx 0$ means "wipe it."
$i_t$: Input gate. Decides how strongly $\tilde{c}_t$ is written into $c_t$.
$o_t$: Output gate. Controls which parts of $\tanh(c_t)$ become visible to the rest of the network.
$W_f, W_i, W_c, W_o$: The four learnable weight matrices (one per gate/candidate). Each has shape hidden × (hidden + input). An LSTM has roughly 4× as many parameters as a vanilla RNN of the same hidden size.

Analogy Imagine a note-taking assistant with a long scrolling notepad (the cell state $c$). At every new sentence, the assistant has three decisions, each made one slot at a time:
(a) Forget gate: for each line on the pad, keep it or cross it out.
(b) Input gate: for each line, decide how much of the new sentence's draft ($\tilde{c}$) to write into that slot.
(c) Output gate: for each line, decide whether to show it to the reader right now, or keep it private for later.
The pad persists across the whole story, and because updates are additive rather than multiplicative, something you wrote down on page 1 can still be legible on page 100.

The key equation to stare at is the cell-state update: $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$. If the forget gate learns to be close to 1 for a particular memory slot, information in that slot gets copied forward unchanged, step after step. Gradients flowing backward through that slot are multiplied by roughly 1 at every step — no vanishing. This is the "constant error carousel" that Hochreiter & Schmidhuber were building toward.

Source — LSTM Cell

// LSTM cell equations (Hochreiter & Schmidhuber, 1997)
// concat = [h_{t-1}, x_t]   (concatenate vectors)

f_t = sigmoid(W_f * concat + b_f)        // forget gate
i_t = sigmoid(W_i * concat + b_i)        // input gate
g_t = tanh(W_g * concat + b_g)           // candidate cell values
o_t = sigmoid(W_o * concat + b_o)        // output gate

c_t = f_t ⊙ c_{t-1} + i_t ⊙ g_t        // cell state update  (⊙ = elementwise)
h_t = o_t ⊙ tanh(c_t)                   // hidden state output

import math

def sigmoid(z): return 1.0 / (1.0 + math.exp(-z))

def matvec(W, x):
    return [sum(w * xi for w, xi in zip(row, x)) for row in W]

def lstm_cell(x_t, h_prev, c_prev,
              Wf, bf, Wi, bi, Wg, bg, Wo, bo):
    """Single LSTM step.
    x_t:    input vector
    h_prev: previous hidden state
    c_prev: previous cell state
    W*:     weight matrices (hidden+input rows)
    b*:     bias vectors
    Returns (h_t, c_t).
    """
    concat = h_prev + x_t          # concatenate vectors

    f = [sigmoid(z) for z in [sum(w*v for w,v in zip(row, concat)) + b
                               for row, b in zip(Wf, bf)]]
    i = [sigmoid(z) for z in [sum(w*v for w,v in zip(row, concat)) + b
                               for row, b in zip(Wi, bi)]]
    g = [math.tanh(z) for z in [sum(w*v for w,v in zip(row, concat)) + b
                                 for row, b in zip(Wg, bg)]]
    o = [sigmoid(z) for z in [sum(w*v for w,v in zip(row, concat)) + b
                               for row, b in zip(Wo, bo)]]

    c_t = [f_k * c_k + i_k * g_k for f_k, c_k, i_k, g_k in zip(f, c_prev, i, g)]
    h_t = [o_k * math.tanh(c_k)  for o_k, c_k in zip(o, c_t)]
    return h_t, c_t

const sigmoid = z => 1 / (1 + Math.exp(-z));

function matvec(W, x) {
  return W.map(row => row.reduce((s, w, j) => s + w * x[j], 0));
}

function lstmCell(xt, hPrev, cPrev,
                  Wf, bf, Wi, bi, Wg, bg, Wo, bo) {
  const concat = [...hPrev, ...xt];
  const apply  = (W, b, fn) => matvec(W, concat).map((z, k) => fn(z + b[k]));

  const f = apply(Wf, bf, sigmoid);
  const i = apply(Wi, bi, sigmoid);
  const g = apply(Wg, bg, Math.tanh);
  const o = apply(Wo, bo, sigmoid);

  const cT = cPrev.map((c, k) => f[k] * c + i[k] * g[k]);
  const hT = cT.map((c, k) => o[k] * Math.tanh(c));
  return { hT, cT };
}

#include <math.h>
#include <string.h>

static float sig(float z) { return 1.0f / (1.0f + expf(-z)); }

/* H = hidden dim, I = input dim.
   Wf/Wi/Wg/Wo: [H * (H+I)] each.  bf/bi/bg/bo: [H] each.
   h_out[H], c_out[H]. */
void lstm_cell(const float *x, const float *h_prev, const float *c_prev,
               const float *Wf, const float *bf,
               const float *Wi, const float *bi,
               const float *Wg, const float *bg,
               const float *Wo, const float *bo,
               int H, int I,
               float *h_out, float *c_out) {
    int HI = H + I;
    float cat[HI];
    memcpy(cat,     h_prev, sizeof(float) * H);
    memcpy(cat + H, x,      sizeof(float) * I);

    float f[H], ig[H], g[H], o[H], c[H];
    for (int k = 0; k < H; k++) {
        float zf = bf[k], zi = bi[k], zg = bg[k], zo = bo[k];
        for (int j = 0; j < HI; j++) {
            zf += Wf[k*HI + j] * cat[j];
            zi += Wi[k*HI + j] * cat[j];
            zg += Wg[k*HI + j] * cat[j];
            zo += Wo[k*HI + j] * cat[j];
        }
        f[k]  = sig(zf);
        ig[k] = sig(zi);
        g[k]  = tanhf(zg);
        o[k]  = sig(zo);
        c[k]  = f[k] * c_prev[k] + ig[k] * g[k];
        h_out[k] = o[k] * tanhf(c[k]);
        c_out[k] = c[k];
    }
}

#include <vector>
#include <cmath>

using Vec = std::vector<float>;
using Mat = std::vector<Vec>;

static float sig(float z) { return 1.0f / (1.0f + std::exp(-z)); }

static Vec gateOutput(const Mat& W, const Vec& b,
                       const Vec& cat, float (*fn)(float)) {
    Vec out(W.size());
    for (size_t k = 0; k < W.size(); k++) {
        float z = b[k];
        for (size_t j = 0; j < cat.size(); j++) z += W[k][j] * cat[j];
        out[k] = fn(z);
    }
    return out;
}

std::pair<Vec, Vec> lstmCell(const Vec& xt, const Vec& hPrev, const Vec& cPrev,
    const Mat& Wf, const Vec& bf,
    const Mat& Wi, const Vec& bi,
    const Mat& Wg, const Vec& bg,
    const Mat& Wo, const Vec& bo) {

    Vec cat;
    cat.insert(cat.end(), hPrev.begin(), hPrev.end());
    cat.insert(cat.end(), xt.begin(), xt.end());

    auto f = gateOutput(Wf, bf, cat, sig);
    auto i = gateOutput(Wi, bi, cat, sig);
    auto g = gateOutput(Wg, bg, cat, std::tanh);
    auto o = gateOutput(Wo, bo, cat, sig);

    Vec cT(cPrev.size()), hT(cPrev.size());
    for (size_t k = 0; k < cT.size(); k++) {
        cT[k] = f[k] * cPrev[k] + i[k] * g[k];
        hT[k] = o[k] * std::tanh(cT[k]);
    }
    return {hT, cT};
}

public class LSTMCell {
    static double sig(double z) { return 1.0 / (1.0 + Math.exp(-z)); }

    static double[] matvec(double[][] W, double[] x) {
        double[] out = new double[W.length];
        for (int i = 0; i < W.length; i++)
            for (int j = 0; j < x.length; j++)
                out[i] += W[i][j] * x[j];
        return out;
    }

    /** Returns [h_t, c_t] where each is a double[H]. */
    public static double[][] cell(double[] xt, double[] hPrev, double[] cPrev,
                                   double[][] Wf, double[] bf,
                                   double[][] Wi, double[] bi,
                                   double[][] Wg, double[] bg,
                                   double[][] Wo, double[] bo) {
        int H = hPrev.length, I = xt.length;
        double[] cat = new double[H + I];
        System.arraycopy(hPrev, 0, cat, 0, H);
        System.arraycopy(xt,    0, cat, H, I);

        double[] zf = matvec(Wf, cat), zi = matvec(Wi, cat),
                 zg = matvec(Wg, cat), zo = matvec(Wo, cat);
        double[] cT = new double[H], hT = new double[H];
        for (int k = 0; k < H; k++) {
            double f = sig(zf[k] + bf[k]);
            double ig = sig(zi[k] + bi[k]);
            double g  = Math.tanh(zg[k] + bg[k]);
            double o  = sig(zo[k] + bo[k]);
            cT[k] = f * cPrev[k] + ig * g;
            hT[k] = o * Math.tanh(cT[k]);
        }
        return new double[][] {hT, cT};
    }
}

import "math"

func sigm(z float64) float64 { return 1.0 / (1.0 + math.Exp(-z)) }

func gateOut(W [][]float64, b, cat []float64, fn func(float64) float64) []float64 {
    out := make([]float64, len(W))
    for k, row := range W {
        z := b[k]
        for j, w := range row { z += w * cat[j] }
        out[k] = fn(z)
    }
    return out
}

// LSTMCell computes one LSTM step.
// Returns (h_t, c_t).
func LSTMCell(xt, hPrev, cPrev []float64,
    Wf [][]float64, bf []float64,
    Wi [][]float64, bi []float64,
    Wg [][]float64, bg []float64,
    Wo [][]float64, bo []float64) ([]float64, []float64) {

    cat := append(append([]float64{}, hPrev...), xt...)
    f := gateOut(Wf, bf, cat, sigm)
    ig := gateOut(Wi, bi, cat, sigm)
    g := gateOut(Wg, bg, cat, math.Tanh)
    o := gateOut(Wo, bo, cat, sigm)

    H := len(cPrev)
    cT := make([]float64, H)
    hT := make([]float64, H)
    for k := 0; k < H; k++ {
        cT[k] = f[k]*cPrev[k] + ig[k]*g[k]
        hT[k] = o[k] * math.Tanh(cT[k])
    }
    return hT, cT
}

Interactive: watch an LSTM carry information forward

Below is a 4-slot LSTM cell state, fed a toy input sequence. Each row of the heatmap is one of the four memory slots; each column is a time step. The gates are set so that slot 1 uses a high forget-gate (persistent memory), slot 2 uses a low forget-gate (short-term), and slots 3 and 4 are in between. Press Step → to advance the sequence.

▸ 4-slot LSTM, 10-step sequence Step 0 / 10

Watch the cell state

Each square is the cell-state value for one memory slot at one time step. Brighter cyan = larger positive value. Slot 1 has a forget gate near 1.0 — anything written there sticks around. Slot 2 has a forget gate near 0.2 — it decays to zero within a step or two. The two middle slots are in between.

When you press Step →, the input gate writes the new token's contribution to all four slots (same write amount), but the forget gate decides how much of the old state survives. Over 10 steps you'll see slot 1 build up a long-term summary while slot 2 only remembers "right now."

6. GRU: a leaner variant

The LSTM has four gates worth of parameters and two state vectors per cell. That's a lot of moving parts. In 2014, Cho et al. proposed the Gated Recurrent Unit (GRU): merge the forget and input gates into a single "update gate," drop the separate cell state, and cut the parameter count by roughly a third.

\begin{aligned} z_t &= \sigma(W_z [h_{t-1}, x_t]) \quad &\text{(update gate)} \\ r_t &= \sigma(W_r [h_{t-1}, x_t]) \quad &\text{(reset gate)} \\ \tilde{h}_t &= \tanh(W_h [r_t \odot h_{t-1}, x_t]) \quad &\text{(candidate)} \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \end{aligned}

GRU glossary

$z_t$: Update gate. Replaces the LSTM's forget+input pair with a single "how much of the old state to replace." The form $(1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$ is a linear interpolation — $z_t = 0$ keeps everything, $z_t = 1$ replaces everything.
$r_t$: Reset gate. Decides how much of the previous hidden state contributes to the candidate. Lets the GRU "start fresh" on phrase boundaries without waiting for the update gate to overwrite.
$\tilde{h}_t$: Candidate new hidden state — the values the cell would move toward.

Empirically, GRUs tend to match LSTMs on most tasks while being lighter and faster. The Transformer era has largely made both of them historical, but if you're running an RNN on a microcontroller, the GRU is usually the right pick.

Source — GRU Cell

// GRU cell (Cho et al., 2014)
// concat(a, b) = concatenate two vectors

z_t = sigmoid(W_z * [h_{t-1}, x_t])        // update gate
r_t = sigmoid(W_r * [h_{t-1}, x_t])        // reset gate

// candidate: note r_t ⊙ h_{t-1} before concatenating with x_t
h_cand = tanh(W_h * [r_t ⊙ h_{t-1}, x_t]) // candidate hidden state

h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h_cand // interpolate old and new

import math

def sigmoid(z): return 1.0 / (1.0 + math.exp(-z))

def gru_cell(x_t, h_prev, Wz, Wr, Wh):
    """Single GRU step.
    x_t:    input vector (list)
    h_prev: previous hidden state (list)
    Wz, Wr, Wh: weight matrices (list of rows); biases absorbed into last column.
    Returns h_t.
    """
    cat = h_prev + x_t       # [h_prev | x_t]

    def gate(W, fn):
        return [fn(sum(w * v for w, v in zip(row, cat))) for row in W]

    z = gate(Wz, sigmoid)
    r = gate(Wr, sigmoid)

    # reset-gated concat
    r_h = [ri * hi for ri, hi in zip(r, h_prev)]
    cat2 = r_h + x_t
    h_cand = [math.tanh(sum(w * v for w, v in zip(row, cat2))) for row in Wh]

    h_t = [(1 - zi) * hi + zi * hc
           for zi, hi, hc in zip(z, h_prev, h_cand)]
    return h_t

const sigmoid = z => 1 / (1 + Math.exp(-z));

function matvec(W, x) {
  return W.map(row => row.reduce((s, w, j) => s + w * x[j], 0));
}

function gruCell(xt, hPrev, Wz, Wr, Wh) {
  const cat  = [...hPrev, ...xt];
  const z    = matvec(Wz, cat).map(sigmoid);
  const r    = matvec(Wr, cat).map(sigmoid);
  const rh   = hPrev.map((h, k) => r[k] * h);
  const cat2 = [...rh, ...xt];
  const hCand = matvec(Wh, cat2).map(Math.tanh);
  return hPrev.map((h, k) => (1 - z[k]) * h + z[k] * hCand[k]);
}

#include <math.h>
#include <string.h>

static float sig(float x) { return 1.0f / (1.0f + expf(-x)); }

/* H = hidden dim, I = input dim.
   Wz, Wr, Wh: [H * (H+I)] each.  h_out[H]. */
void gru_cell(const float *x, const float *h_prev,
              const float *Wz, const float *Wr, const float *Wh,
              int H, int I, float *h_out) {
    int HI = H + I;
    float cat[HI], z[H], r[H], rh[H], cat2[HI], hcand[H];
    memcpy(cat,     h_prev, sizeof(float) * H);
    memcpy(cat + H, x,      sizeof(float) * I);

    for (int k = 0; k < H; k++) {
        float zz = 0, zr = 0;
        for (int j = 0; j < HI; j++) { zz += Wz[k*HI+j]*cat[j]; zr += Wr[k*HI+j]*cat[j]; }
        z[k] = sig(zz);
        r[k] = sig(zr);
        rh[k] = r[k] * h_prev[k];
    }

    memcpy(cat2,     rh, sizeof(float) * H);
    memcpy(cat2 + H, x,  sizeof(float) * I);

    for (int k = 0; k < H; k++) {
        float zh = 0;
        for (int j = 0; j < HI; j++) zh += Wh[k*HI+j] * cat2[j];
        hcand[k] = tanhf(zh);
        h_out[k] = (1.0f - z[k]) * h_prev[k] + z[k] * hcand[k];
    }
}

#include <vector>
#include <cmath>

using Vec = std::vector<float>;
using Mat = std::vector<Vec>;

static float sig(float z) { return 1.0f / (1.0f + std::exp(-z)); }
static Vec matvec(const Mat& W, const Vec& x) {
    Vec o(W.size(), 0);
    for (size_t i = 0; i < W.size(); i++)
        for (size_t j = 0; j < x.size(); j++) o[i] += W[i][j] * x[j];
    return o;
}

Vec gruCell(const Vec& xt, const Vec& hPrev,
            const Mat& Wz, const Mat& Wr, const Mat& Wh) {
    Vec cat;
    cat.insert(cat.end(), hPrev.begin(), hPrev.end());
    cat.insert(cat.end(), xt.begin(), xt.end());

    auto applyGate = [&](const Mat& W, float (*fn)(float)) {
        auto z = matvec(W, cat);
        for (auto& v : z) v = fn(v);
        return z;
    };
    auto z = applyGate(Wz, sig);
    auto r = applyGate(Wr, sig);

    Vec rh(hPrev.size());
    for (size_t k = 0; k < hPrev.size(); k++) rh[k] = r[k] * hPrev[k];
    Vec cat2;
    cat2.insert(cat2.end(), rh.begin(), rh.end());
    cat2.insert(cat2.end(), xt.begin(), xt.end());

    auto hCandRaw = matvec(Wh, cat2);
    Vec hT(hPrev.size());
    for (size_t k = 0; k < hT.size(); k++) {
        float hc = std::tanh(hCandRaw[k]);
        hT[k] = (1.0f - z[k]) * hPrev[k] + z[k] * hc;
    }
    return hT;
}

public class GRUCell {
    static double sig(double z) { return 1.0 / (1.0 + Math.exp(-z)); }

    static double[] matvec(double[][] W, double[] x) {
        double[] out = new double[W.length];
        for (int i = 0; i < W.length; i++)
            for (int j = 0; j < x.length; j++)
                out[i] += W[i][j] * x[j];
        return out;
    }

    public static double[] cell(double[] xt, double[] hPrev,
                                 double[][] Wz, double[][] Wr, double[][] Wh) {
        int H = hPrev.length, I = xt.length;
        double[] cat = new double[H + I];
        System.arraycopy(hPrev, 0, cat, 0, H);
        System.arraycopy(xt,    0, cat, H, I);

        double[] zRaw = matvec(Wz, cat), rRaw = matvec(Wr, cat);
        double[] z = new double[H], r = new double[H];
        for (int k = 0; k < H; k++) { z[k] = sig(zRaw[k]); r[k] = sig(rRaw[k]); }

        double[] cat2 = new double[H + I];
        for (int k = 0; k < H; k++) cat2[k] = r[k] * hPrev[k];
        System.arraycopy(xt, 0, cat2, H, I);

        double[] hcRaw = matvec(Wh, cat2);
        double[] hT = new double[H];
        for (int k = 0; k < H; k++) {
            double hc = Math.tanh(hcRaw[k]);
            hT[k] = (1 - z[k]) * hPrev[k] + z[k] * hc;
        }
        return hT;
    }
}

import "math"

func sigm64(z float64) float64 { return 1.0 / (1.0 + math.Exp(-z)) }

func mv(W [][]float64, x []float64) []float64 {
    out := make([]float64, len(W))
    for i, row := range W {
        for j, w := range row { out[i] += w * x[j] }
    }
    return out
}

// GRUCell computes one GRU step and returns h_t.
func GRUCell(xt, hPrev []float64,
    Wz, Wr, Wh [][]float64) []float64 {
    cat := append(append([]float64{}, hPrev...), xt...)

    applyGate := func(W [][]float64, fn func(float64) float64) []float64 {
        out := mv(W, cat)
        for i := range out { out[i] = fn(out[i]) }
        return out
    }

    z := applyGate(Wz, sigm64)
    r := applyGate(Wr, sigm64)

    rh := make([]float64, len(hPrev))
    for k := range rh { rh[k] = r[k] * hPrev[k] }
    cat2 := append(append([]float64{}, rh...), xt...)

    hcRaw := mv(Wh, cat2)
    hT := make([]float64, len(hPrev))
    for k := range hT {
        hc := math.Tanh(hcRaw[k])
        hT[k] = (1-z[k])*hPrev[k] + z[k]*hc
    }
    return hT
}

7. Source code — build one yourself

A vanilla RNN fits in a dozen lines in almost any language. Here is the same minimal forward pass written four ways. The inputs are the same in each: a list of token vectors xs, learned matrices Wxh, Whh, Why, and bias vectors.

Source — Character-Level RNN

rnn · forward pass

import numpy as np

def rnn_forward(xs, h0, Wxh, Whh, Why, bh, by):
    # xs: list of input vectors, one per time step
    # h0: initial hidden state (zeros, usually)
    h = h0
    ys = []
    for x in xs:
        h = np.tanh(Whh @ h + Wxh @ x + bh)
        y = Why @ h + by
        ys.append(y)
    return ys, h

import torch
import torch.nn as nn

class VanillaRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.Wxh = nn.Linear(input_size,  hidden_size, bias=False)
        self.Whh = nn.Linear(hidden_size, hidden_size)
        self.Why = nn.Linear(hidden_size, output_size)

    def forward(self, xs, h=None):
        # xs: (T, B, input_size)
        T, B, _ = xs.shape
        if h is None:
            h = torch.zeros(B, self.Whh.out_features, device=xs.device)
        ys = []
        for t in range(T):
            h = torch.tanh(self.Whh(h) + self.Wxh(xs[t]))
            ys.append(self.Why(h))
        return torch.stack(ys), h

import jax, jax.numpy as jnp
from jax import lax

def rnn_forward(params, xs, h0):
    Wxh, Whh, Why, bh, by = params

    def step(h, x):
        h_new = jnp.tanh(Whh @ h + Wxh @ x + bh)
        y = Why @ h_new + by
        return h_new, y

    # lax.scan is the loop-free, XLA-friendly idiom
    h_final, ys = lax.scan(step, h0, xs)
    return ys, h_final

/* Pedagogical C — no BLAS, one matrix-vector at a time. */
void rnn_forward(
    const float *xs,   // [T][I]
    const float *Wxh,  // [H][I]
    const float *Whh,  // [H][H]
    const float *Why,  // [O][H]
    const float *bh,   // [H]
    const float *by,   // [O]
    int T, int I, int H, int O,
    float *ys         // [T][O] output
) {
    float h[H], h_new[H];
    for (int i = 0; i < H; i++) h[i] = 0.f;

    for (int t = 0; t < T; t++) {
        // h_new = tanh(Whh h + Wxh x + bh)
        for (int k = 0; k < H; k++) {
            float z = bh[k];
            for (int j = 0; j < H; j++) z += Whh[k*H + j] * h[j];
            for (int j = 0; j < I; j++) z += Wxh[k*I + j] * xs[t*I + j];
            h_new[k] = tanhf(z);
        }
        for (int i = 0; i < H; i++) h[i] = h_new[i];

        // y = Why h + by
        for (int k = 0; k < O; k++) {
            float z = by[k];
            for (int j = 0; j < H; j++) z += Why[k*H + j] * h[j];
            ys[t*O + k] = z;
        }
    }
}

Note the shared structure: three matrix-vector multiplies, a $\tanh$, and a loop over time. Everything else — GPU tiling, batching, mixed-precision — is performance dressing around this three-line inner loop.

8. Legacy and successors

LSTMs and GRUs dominated sequence modelling for about fifteen years. They powered the first wave of neural machine translation (seq2seq, 2014), Siri's speech recognition, early language models like Mikolov's RNNLM, and countless time-series forecasters. Even today, when you're training something on an embedded chip or inference has to happen a byte at a time, a GRU is often still the right answer.

The thing they could not do is parallelize over the time axis. An RNN has to process step $t$ before step $t+1$ because $h_t$ is an input to the computation of $h_{t+1}$. On a GPU, that's a brutal limitation: you can't feed all 2048 tokens of a context window through at once; you have to march. The Transformer (2017) replaced the recurrence with self-attention and got rid of the sequential bottleneck — every token in the sequence could be processed in parallel on the forward pass, and the "how do distant tokens communicate?" question was answered by direct content-based lookups rather than by threading information through a chain of hidden states. That is why the Transformer won, and why LSTMs are no longer the default for language. But every idea in the Transformer — attention, residual connections, layer norm — was first tried inside LSTMs by people trying to squeeze out more range. The LSTM taught the field what mattered.

9. Summary

RNNs process a sequence one element at a time, carrying a fixed-size hidden state forward as a summary of everything seen so far. The same weight matrices are used at every step.
Training uses backprop through time, which unrolls the recurrence and sums gradient contributions across steps.
Vanilla RNNs suffer from the vanishing-gradient problem: the product of Jacobians flowing back in time either shrinks or blows up exponentially, making long-range dependencies unlearnable.
The LSTM introduces a separate cell state updated additively (not multiplicatively), gated by three learnable sigmoids — forget, input, output. Information can ride the cell state for hundreds of steps without degradation.
The GRU merges gates to cut parameter count; usually matches LSTM performance.
Transformers replaced recurrence with self-attention in 2017, solving both vanishing gradients (no unbounded Jacobian product) and the parallelization bottleneck in one move. RNNs survive in edge/on-device use cases.