Gradient Descent
The optimization algorithm sitting underneath every neural network on Earth. Three lines of math, and an enormous amount of subtlety about how to make them work on real problems.
1. The engine of learning
Every learning algorithm in deep learning boils down to the same loop:
- Pick a loss function $L(\boldsymbol{\theta})$ that measures how wrong the model is, parameterized by weights $\boldsymbol{\theta}$.
- Compute the gradient $\nabla_{\boldsymbol{\theta}} L$.
- Take a small step against the gradient.
- Repeat.
Step 2 is where backpropagation lives. Step 3 — the "take a small step" part — is gradient descent. It has been around since Cauchy (1847), and it is the single reason deep learning is an engineering discipline rather than a branch of philosophy.
The gradient is the direction of steepest increase of a function. The negative gradient is the direction of steepest decrease. If your loss surface is reasonably smooth and your step size is reasonably small, moving in the negative-gradient direction is guaranteed to decrease the loss. Repeat enough times and you land in a minimum.
2. What the gradient is
If $L : \mathbb{R}^n \to \mathbb{R}$ is a scalar-valued function of $n$ variables, its gradient at a point $\boldsymbol{\theta}$ is the vector of partial derivatives:
Two properties matter:
- Direction. $\nabla L$ points in the direction in which $L$ increases the fastest.
- Magnitude. $\|\nabla L\|$ measures how steeply $L$ is increasing in that direction. A small gradient means "it's almost flat here"; a huge gradient means "we're on a cliff."
A quick derivation of the "direction of steepest ascent" claim: for a small step $\boldsymbol{v}$ with $\|\boldsymbol{v}\| = 1$, a first-order Taylor expansion gives
The change in $L$ is maximized when $\boldsymbol{v}$ is aligned with $\nabla L$ (i.e. $\boldsymbol{v} = \nabla L / \|\nabla L\|$), and minimized — most negative — when $\boldsymbol{v}$ is anti-aligned. That's why we move in the direction $-\nabla L$ to minimize $L$.
Let $L(\theta_1, \theta_2) = (\theta_1 - 2)^2 + 3(\theta_2 + 1)^2$.
$\partial L / \partial \theta_1 = 2(\theta_1 - 2)$
$\partial L / \partial \theta_2 = 6(\theta_2 + 1)$
At $\boldsymbol{\theta} = (0, 0)$: $\nabla L = [-4, 6]$. The steepest-descent direction is $[4, -6]$. The gradient vanishes (is zero) exactly at the minimum $\boldsymbol{\theta}^* = (2, -1)$.
3. The update rule
Gradient descent is one line:
The scalar $\eta > 0$ is the learning rate (sometimes called the step size). It is a hyperparameter: you set it by hand and tune it. Everything interesting about gradient descent — convergence speed, stability, whether it works at all — depends on this number.
A function is convex if any line segment between two points on its graph lies above the graph. Formally, $L$ is convex if for all $\boldsymbol{\theta}, \boldsymbol{\theta}'$ and $\lambda \in [0, 1]$: $L(\lambda \boldsymbol{\theta} + (1 - \lambda) \boldsymbol{\theta}') \le \lambda L(\boldsymbol{\theta}) + (1 - \lambda) L(\boldsymbol{\theta}')$. A convex function has exactly one "valley" — a single global minimum (or a connected set of minima).
For convex, sufficiently smooth $L$ and a learning rate below a certain critical value, gradient descent is guaranteed to converge to the global minimum. Neural network losses are not convex — they have thousands of local minima and saddle points — but empirically the minima SGD finds are "good enough" for deep networks. We'll come back to this.
Source — Gradient Descent
function gradient_descent(f, grad_f, x0, lr, n_iter):
x = x0
for t in 1..n_iter:
g = grad_f(x) # compute gradient at current point
x = x - lr * g # take step against the gradient
print(t, x, f(x))
return x
# Example: minimize L(theta) = theta^2 - 2*theta + 3
# grad_L(theta) = 2*theta - 2
# minimum at theta* = 1, L(1) = 2def gradient_descent(grad_f, x0, lr=0.1, n_iter=100, tol=1e-7):
"""
Generic gradient descent.
grad_f: callable, takes x and returns gradient (same shape as x)
x0: starting point (float or np.ndarray)
lr: learning rate (step size)
Returns history of (x, loss) pairs.
"""
import numpy as np
x = np.array(x0, dtype=float)
history = []
for t in range(n_iter):
g = grad_f(x)
x = x - lr * g
history.append(x.copy())
if np.linalg.norm(g) < tol:
print(f"Converged at iteration {t+1}")
break
return x, history
# --- 1D example: L(theta) = theta^2 - 2*theta + 3 ---
def grad_L1d(theta): return 2*theta - 2
x_star, hist = gradient_descent(grad_L1d, x0=5.0, lr=0.3, n_iter=50)
print(f"Minimum found at theta = {x_star:.6f} (true: 1.0)")
# --- Multi-dimensional example: L(x) = x1^2 + 10*x2^2 ---
import numpy as np
def grad_bowl(x): return np.array([2*x[0], 20*x[1]])
x_star2, hist2 = gradient_descent(grad_bowl, x0=np.array([3.0, 2.0]), lr=0.05, n_iter=200)
print(f"2D minimum: {x_star2}")/**
* Generic gradient descent for scalar or array-valued theta.
* gradF: (theta) => gradient (same type as theta)
* x0: starting value (number or number[])
* Returns final theta.
*/
function gradientDescent(gradF, x0, lr = 0.1, nIter = 100, tol = 1e-7) {
let x = Array.isArray(x0) ? [...x0] : x0;
const isArray = Array.isArray(x0);
for (let t = 0; t < nIter; t++) {
const g = gradF(x);
if (isArray) {
const norm = Math.sqrt(g.reduce((s, gi) => s + gi * gi, 0));
if (norm < tol) { console.log(`Converged at iter ${t+1}`); break; }
x = x.map((xi, i) => xi - lr * g[i]);
} else {
if (Math.abs(g) < tol) { console.log(`Converged at iter ${t+1}`); break; }
x = x - lr * g;
}
}
return x;
}
// 1D: L(theta) = theta^2 - 2*theta + 3
const theta = gradientDescent(t => 2*t - 2, 5.0, 0.3, 50);
console.log("1D minimum:", theta.toFixed(6)); // expect ~1.0
// 2D: L(x) = x1^2 + 10*x2^2
const x2d = gradientDescent(
([x1, x2]) => [2*x1, 20*x2],
[3.0, 2.0], 0.05, 200
);
console.log("2D minimum:", x2d.map(v => v.toFixed(6)));#include <stdio.h>
#include <math.h>
/* Generic 1-D gradient descent */
double gradient_descent_1d(
double (*f)(double),
double (*grad_f)(double),
double x0, double lr, int n_iter, double tol)
{
double x = x0;
for (int t = 0; t < n_iter; t++) {
double g = grad_f(x);
x -= lr * g;
if (fabs(g) < tol) {
printf("Converged at iter %d\n", t + 1);
break;
}
}
return x;
}
/* Multi-dimensional: operates on arrays */
void gradient_descent_nd(
void (*grad_f)(const double *, double *, int),
double *x, int n, double lr, int n_iter, double tol)
{
double g[n];
for (int t = 0; t < n_iter; t++) {
grad_f(x, g, n);
double norm = 0;
for (int i = 0; i < n; i++) { x[i] -= lr * g[i]; norm += g[i]*g[i]; }
if (sqrt(norm) < tol) { printf("Converged at iter %d\n", t+1); break; }
}
}
/* L(theta) = theta^2 - 2*theta + 3 */
double L1d(double t) { return t*t - 2*t + 3; }
double grad_L1d(double t) { return 2*t - 2; }
/* L(x) = x[0]^2 + 10*x[1]^2 */
void grad_bowl(const double *x, double *g, int n) {
g[0] = 2*x[0]; g[1] = 20*x[1];
}
int main(void) {
double theta = gradient_descent_1d(L1d, grad_L1d, 5.0, 0.3, 100, 1e-7);
printf("1D minimum: theta = %.6f (true: 1.0)\n", theta);
double x[2] = {3.0, 2.0};
gradient_descent_nd(grad_bowl, x, 2, 0.05, 200, 1e-7);
printf("2D minimum: [%.6f, %.6f]\n", x[0], x[1]);
return 0;
}#include <cmath>
#include <vector>
#include <functional>
#include <iostream>
#include <numeric>
using Vec = std::vector<double>;
Vec gradientDescent(
std::function<Vec(const Vec&)> gradF,
Vec x, double lr = 0.1, int nIter = 200, double tol = 1e-7)
{
for (int t = 0; t < nIter; t++) {
Vec g = gradF(x);
double norm = 0;
for (size_t i = 0; i < x.size(); i++) {
x[i] -= lr * g[i];
norm += g[i] * g[i];
}
if (std::sqrt(norm) < tol) {
std::cout << "Converged at iter " << t + 1 << "\n";
break;
}
}
return x;
}
int main() {
// 1D: L(theta) = theta^2 - 2*theta + 3
auto theta = gradientDescent(
[](const Vec& x) { return Vec{2*x[0] - 2}; },
{5.0}, 0.3, 100);
std::cout << "1D minimum: " << theta[0] << " (true: 1.0)\n";
// 2D: L(x) = x1^2 + 10*x2^2
auto x2d = gradientDescent(
[](const Vec& x) { return Vec{2*x[0], 20*x[1]}; },
{3.0, 2.0}, 0.05, 300);
std::cout << "2D minimum: [" << x2d[0] << ", " << x2d[1] << "]\n";
}import java.util.function.Function;
public class GradientDescent {
@FunctionalInterface
interface VecFunc { double[] apply(double[] x); }
static double[] gradientDescent(
VecFunc gradF, double[] x0,
double lr, int nIter, double tol) {
double[] x = x0.clone();
for (int t = 0; t < nIter; t++) {
double[] g = gradF.apply(x);
double norm = 0;
for (int i = 0; i < x.length; i++) {
x[i] -= lr * g[i];
norm += g[i] * g[i];
}
if (Math.sqrt(norm) < tol) {
System.out.println("Converged at iter " + (t + 1));
break;
}
}
return x;
}
public static void main(String[] args) {
// 1D: L(theta) = theta^2 - 2*theta + 3
double[] theta = gradientDescent(
x -> new double[]{2 * x[0] - 2},
new double[]{5.0}, 0.3, 100, 1e-7);
System.out.printf("1D minimum: %.6f (true: 1.0)%n", theta[0]);
// 2D: L(x) = x1^2 + 10*x2^2
double[] x2d = gradientDescent(
x -> new double[]{2 * x[0], 20 * x[1]},
new double[]{3.0, 2.0}, 0.05, 300, 1e-7);
System.out.printf("2D minimum: [%.6f, %.6f]%n", x2d[0], x2d[1]);
}
}package main
import (
"fmt"
"math"
)
func gradientDescent(
gradF func([]float64) []float64,
x0 []float64, lr float64, nIter int, tol float64) []float64 {
x := make([]float64, len(x0))
copy(x, x0)
g := make([]float64, len(x0))
for t := 0; t < nIter; t++ {
g = gradF(x)
norm := 0.0
for i := range x {
x[i] -= lr * g[i]
norm += g[i] * g[i]
}
if math.Sqrt(norm) < tol {
fmt.Printf("Converged at iter %d\n", t+1)
break
}
}
return x
}
func main() {
// 1D: L(theta) = theta^2 - 2*theta + 3
theta := gradientDescent(
func(x []float64) []float64 { return []float64{2*x[0] - 2} },
[]float64{5.0}, 0.3, 100, 1e-7)
fmt.Printf("1D minimum: %.6f (true: 1.0)\n", theta[0])
// 2D: L(x) = x1^2 + 10*x2^2
x2d := gradientDescent(
func(x []float64) []float64 { return []float64{2 * x[0], 20 * x[1]} },
[]float64{3.0, 2.0}, 0.05, 300, 1e-7)
fmt.Printf("2D minimum: %v\n", x2d)
}4. Descent on a 1D curve
Let's start with the simplest possible case: minimizing a function of one variable. We'll use $L(\theta) = \theta^2 - 2\theta + 3$, which is a parabola with minimum at $\theta^* = 1$, $L(1) = 2$. The gradient is $L'(\theta) = 2\theta - 2$.
Starting from $\theta_0 = 5$ with $\eta = 0.3$:
You can already see the iterates approaching 1 but doing it more slowly as they get closer — a classic exponential-decay pattern around a quadratic minimum. Try it yourself below.
Interactive: 1D gradient descent
Step through gradient descent on $L(\theta) = \theta^2 - 2\theta + 3$. Change the starting point and learning rate to watch convergence, slow dragging, bouncing around the minimum, or even divergence when $\eta$ is too large.
READY
The pink dot is the current iterate $\theta_t$. The yellow line is the tangent at that point — its slope is $L'(\theta_t)$. Each step moves the dot by $-\eta L'(\theta_t)$.
Press Step → to advance one iteration.
5. The learning rate is everything
Play with the demo above. You will observe four regimes depending on $\eta$:
η too small
Updates are tiny. Progress is slow. The iterate crawls toward the minimum. You will run out of patience (and compute budget) before converging.
η just right
Updates are proportional to the distance from the minimum. Convergence is roughly exponential — each step shrinks the remaining distance by a constant factor. This is what you want.
η too large
The iterate overshoots the minimum, lands on the other side, overshoots again, and oscillates. For moderate overshoot, it still converges but with a characteristic zigzag. For large overshoot, it diverges — the iterates move further from the minimum each step.
η genuinely too large
The loss increases after every step and eventually blows up to infinity (NaN in floating point). You will see this as training loss exploding in practice, usually with a spike on the first few steps.
For a quadratic $L(\theta) = \frac{1}{2} \lambda \theta^2$ with $\lambda > 0$, gradient descent converges if and only if $\eta < 2/\lambda$, and converges fastest at $\eta = 1/\lambda$. That factor of $\lambda$ — the second derivative, the "curvature" — is what sets the right scale for $\eta$.
Take $L(\theta) = \theta^2$ (so $\lambda = 2$), starting at $\theta_0 = 1$ with $\eta = 1.5$ (which violates the bound $\eta < 2/\lambda = 1$).
$\theta_1 = 1 - 1.5 \cdot 2 = -2$
$\theta_2 = -2 - 1.5 \cdot (-4) = 4$
$\theta_3 = 4 - 1.5 \cdot 8 = -8$
$\theta_4 = -8 - 1.5 \cdot (-16) = 16$
The iterates oscillate in sign and double in magnitude each step. $L$ goes 1 → 4 → 16 → 64 → 256 → ∞. Classic divergence.
6. Descent on a 2D surface
Neural networks have millions of parameters, not one. The picture that carries over from 1D to higher dimensions is a loss landscape: a hypersurface in $\mathbb{R}^{n+1}$ whose height at each point is $L(\boldsymbol{\theta})$. We can visualize this directly only for $n \le 2$, but the intuition holds.
The classic 2D test function is the stretched bowl (elliptical paraboloid):
The stretched-bowl loss, unpacked
- $\theta_1, \theta_2$
- The two parameters we're optimizing over. In a real network these would be two among millions of weights.
- $a, b$
- Curvature coefficients along each axis. Both positive (so the bowl opens upward). When $a = b$, the bowl is rotationally symmetric; when $a \ne b$, one axis is steeper than the other.
- $\kappa = \max(a,b)/\min(a,b)$
- The condition number — how lopsided the bowl is. $\kappa = 1$ is a circular bowl, easy; $\kappa = 100$ is a long thin canyon, painful. In real neural nets $\kappa$ can exceed $10^6$.
- $\nabla L = (2a\theta_1,\ 2b\theta_2)$
- The gradient. Each component depends only on its own coordinate and the matching curvature. If $b \gg a$, a small step in $\theta_2$ is a large step in loss — you have to move gingerly on that axis, but that forces the entire step size to be small on the $\theta_1$ axis too.
Analogy Imagine a steep, narrow canyon with a gentle slope running along its length toward the exit. Rolling a ball down it: gravity along the canyon (the $\theta_1$ direction) barely moves the ball; gravity across it (the $\theta_2$ direction) sends it ricocheting off the canyon walls. That ricochet is the zigzag. Momentum damps it by averaging the oscillations out.
When $a = b$, level sets are circles and gradient descent takes a straight line to the minimum. When $a \ne b$ the level sets are elongated ellipses, and gradient descent develops the infamous zigzag: its direction alternates between bouncing across the narrow axis (steep gradient) and crawling along the long axis (shallow gradient). The condition number $\kappa = \max(a,b) / \min(a,b)$ measures how bad this is.
3D view: what the "zigzag" actually looks like
Below is an isometric render of the same loss surface, with a live gradient-descent trajectory traced across it. The contour lines on the floor are the level sets you'd see in a 2D top-down view — the trajectory above hovers on the surface itself. Press Step → to descend one iteration at a time and watch the path cling to the bowl.
Isometric projection of the loss bowl. The mesh shows the surface; the lines on the base plane are contour rings. The pink trajectory is a live gradient-descent path.
Interactive: 2D gradient descent with condition number
Explore descent on $L(\theta_1, \theta_2) = a \theta_1^2 + b \theta_2^2$. The contour plot shows level sets; arrows trace the trajectory. Crank up the ratio $b/a$ to see zigzagging; try optionally enabling momentum to see it help.
READY
Blue ellipses are level sets of $L$. The pink dot starts at $(\theta_1^{(0)}, \theta_2^{(0)})$. Each step moves in the $-\nabla L$ direction.
With $a=1, b=5$: the cost in the $\theta_2$ direction is much steeper than in $\theta_1$, so gradient descent zigzags across the narrow axis.
7. Batch, stochastic, and mini-batch
So far we have pretended that $L(\boldsymbol{\theta})$ is a single function. In practice it is a sum over training examples:
where $\ell_i$ is the loss on the $i$-th training example and $m$ is the dataset size. The gradient splits the same way: $\nabla L = \frac{1}{m} \sum_i \nabla \ell_i$.
Batch gradient descent
Compute the full gradient over all $m$ examples, then take a step. High quality gradients; every step is deterministic. But a single update requires looking at the whole dataset. Too expensive for big data — you may wait an hour for one step.
Stochastic gradient descent (SGD)
Pick one example at random, compute $\nabla \ell_i$, take a step. $m$ updates per epoch. Very cheap per step and surprisingly effective, but the gradients are noisy — the trajectory wanders instead of going in a clean straight line.
Mini-batch SGD
The compromise everyone actually uses. Pick a small batch of $B$ examples (32, 64, 256, 1024…), average their gradients, take a step. Vectorizes beautifully on a GPU; gradients are smoother than SGD but still cheap per step.
SGD's noisy gradients are not just a tolerated compromise — they are part of why deep learning works. The noise acts as a regularizer that prevents the optimizer from fitting sharp, fragile minima of the training loss. Flat minima generalize better, and SGD's stochasticity nudges it toward them.
8. Momentum
Pure gradient descent has no memory: each step depends only on the gradient at the current point. Momentum adds a second term — a running average of past gradients — that accelerates steady directions and damps oscillations.
The scalar $\beta \in [0, 1)$ is the momentum coefficient. Typical values: $\beta = 0.9$ or $0.99$.
Physical intuition: imagine a ball rolling down the loss landscape. The velocity $\boldsymbol{v}$ accumulates over time — if the gradient keeps pointing in the same direction, the ball speeds up. In directions where the gradient oscillates (like the narrow axis of an elongated bowl), the oscillations partially cancel in the running average, so the ball doesn't bounce. The effect is dramatic: momentum can turn a 1000-step zigzag into a 100-step curving glide.
Unrolling the recursion: $\boldsymbol{v}_t = \sum_{k=0}^{t-1} \beta^k \nabla L(\boldsymbol{\theta}_{t-1-k})$.
$\boldsymbol{v}_t$ is an exponentially-weighted sum of all past gradients. The effective memory length is roughly $1/(1-\beta)$: at $\beta = 0.9$, the last ~10 gradients contribute meaningfully; at $\beta = 0.99$, the last ~100.
Try setting momentum to 0.9 in the 2D demo above. The trail should straighten out dramatically.
Source — Momentum (SGD with Momentum)
function sgd_momentum(grad_f, x0, lr, beta, n_iter):
x = x0
v = zeros_like(x) # velocity initialized to zero
for t in 1..n_iter:
g = grad_f(x)
v = beta * v + g # accumulate velocity
x = x - lr * v # update parameters
return x
# beta = 0: vanilla SGD
# beta = 0.9: typical momentum (last ~10 steps contribute)
# beta = 0.99: strong momentum (last ~100 steps contribute)import numpy as np
def sgd_momentum(grad_f, x0, lr=0.01, beta=0.9, n_iter=200, tol=1e-7):
"""
SGD with momentum.
grad_f: callable(x) -> gradient
beta: momentum coefficient (0 = plain SGD, typical 0.9)
"""
x = np.array(x0, dtype=float)
v = np.zeros_like(x)
for t in range(n_iter):
g = grad_f(x)
v = beta * v + g # update velocity
x = x - lr * v # update parameters
if np.linalg.norm(g) < tol:
print(f"Converged at iter {t+1}")
break
return x
# 2D stretched bowl: L(x) = x1^2 + 10*x2^2
def grad_bowl(x): return np.array([2*x[0], 20*x[1]])
print("No momentum:")
x1 = sgd_momentum(grad_bowl, [3.0, 2.0], lr=0.05, beta=0.0, n_iter=500)
print(f" Result: {x1}")
print("Momentum=0.9:")
x2 = sgd_momentum(grad_bowl, [3.0, 2.0], lr=0.05, beta=0.9, n_iter=500)
print(f" Result: {x2}")function sgdMomentum(gradF, x0, lr = 0.01, beta = 0.9, nIter = 200, tol = 1e-7) {
let x = [...x0];
let v = new Array(x0.length).fill(0);
for (let t = 0; t < nIter; t++) {
const g = gradF(x);
const norm = Math.sqrt(g.reduce((s, gi) => s + gi * gi, 0));
if (norm < tol) { console.log(`Converged at iter ${t+1}`); break; }
v = v.map((vi, i) => beta * vi + g[i]); // update velocity
x = x.map((xi, i) => xi - lr * v[i]); // update params
}
return x;
}
// 2D stretched bowl: L(x) = x1^2 + 10*x2^2
const gradBowl = ([x1, x2]) => [2*x1, 20*x2];
console.log("No momentum:",
sgdMomentum(gradBowl, [3, 2], 0.05, 0.0, 500).map(v => v.toFixed(5)));
console.log("Momentum=0.9:",
sgdMomentum(gradBowl, [3, 2], 0.05, 0.9, 500).map(v => v.toFixed(5)));#include <stdio.h>
#include <math.h>
#include <string.h>
#define NDIM 2
void sgd_momentum(
void (*grad_f)(const double *, double *, int),
double *x, int n, double lr, double beta,
int n_iter, double tol)
{
double v[NDIM], g[NDIM];
memset(v, 0, sizeof(double) * n);
for (int t = 0; t < n_iter; t++) {
grad_f(x, g, n);
double norm = 0;
for (int i = 0; i < n; i++) {
v[i] = beta * v[i] + g[i];
x[i] -= lr * v[i];
norm += g[i] * g[i];
}
if (sqrt(norm) < tol) {
printf("Converged at iter %d\n", t + 1);
break;
}
}
}
void grad_bowl(const double *x, double *g, int n) {
g[0] = 2 * x[0];
g[1] = 20 * x[1];
}
int main(void) {
double x[NDIM] = {3.0, 2.0};
sgd_momentum(grad_bowl, x, NDIM, 0.05, 0.9, 500, 1e-7);
printf("Minimum: [%.6f, %.6f]\n", x[0], x[1]);
return 0;
}#include <cmath>
#include <vector>
#include <functional>
#include <iostream>
using Vec = std::vector<double>;
Vec sgdMomentum(
std::function<Vec(const Vec&)> gradF,
Vec x, double lr = 0.01, double beta = 0.9,
int nIter = 200, double tol = 1e-7)
{
Vec v(x.size(), 0.0);
for (int t = 0; t < nIter; t++) {
Vec g = gradF(x);
double norm = 0;
for (size_t i = 0; i < x.size(); i++) {
v[i] = beta * v[i] + g[i];
x[i] -= lr * v[i];
norm += g[i] * g[i];
}
if (std::sqrt(norm) < tol) {
std::cout << "Converged at iter " << t + 1 << "\n";
break;
}
}
return x;
}
int main() {
auto gradBowl = [](const Vec& x) { return Vec{2*x[0], 20*x[1]}; };
auto x1 = sgdMomentum(gradBowl, {3.0, 2.0}, 0.05, 0.0, 500);
std::cout << "No momentum: [" << x1[0] << ", " << x1[1] << "]\n";
auto x2 = sgdMomentum(gradBowl, {3.0, 2.0}, 0.05, 0.9, 500);
std::cout << "Momentum=0.9: [" << x2[0] << ", " << x2[1] << "]\n";
}import java.util.Arrays;
public class SGDMomentum {
@FunctionalInterface
interface VecFunc { double[] apply(double[] x); }
static double[] sgdMomentum(
VecFunc gradF, double[] x0,
double lr, double beta, int nIter, double tol) {
double[] x = x0.clone();
double[] v = new double[x0.length]; // initialized to 0
for (int t = 0; t < nIter; t++) {
double[] g = gradF.apply(x);
double norm = 0;
for (int i = 0; i < x.length; i++) {
v[i] = beta * v[i] + g[i];
x[i] -= lr * v[i];
norm += g[i] * g[i];
}
if (Math.sqrt(norm) < tol) {
System.out.println("Converged at iter " + (t + 1));
break;
}
}
return x;
}
public static void main(String[] args) {
double[] x1 = sgdMomentum(
x -> new double[]{2*x[0], 20*x[1]},
new double[]{3.0, 2.0}, 0.05, 0.9, 500, 1e-7);
System.out.printf("Minimum: [%.6f, %.6f]%n", x1[0], x1[1]);
}
}package main
import (
"fmt"
"math"
)
func sgdMomentum(
gradF func([]float64) []float64,
x0 []float64, lr, beta float64,
nIter int, tol float64) []float64 {
x := make([]float64, len(x0))
copy(x, x0)
v := make([]float64, len(x0)) // velocity = 0
for t := 0; t < nIter; t++ {
g := gradF(x)
norm := 0.0
for i := range x {
v[i] = beta*v[i] + g[i]
x[i] -= lr * v[i]
norm += g[i] * g[i]
}
if math.Sqrt(norm) < tol {
fmt.Printf("Converged at iter %d\n", t+1)
break
}
}
return x
}
func main() {
gradBowl := func(x []float64) []float64 {
return []float64{2 * x[0], 20 * x[1]}
}
x := sgdMomentum(gradBowl, []float64{3.0, 2.0}, 0.05, 0.9, 500, 1e-7)
fmt.Printf("Minimum: %v\n", x)
}9. Adaptive methods: AdaGrad, RMSProp, Adam
Momentum fixes zigzagging but keeps a single global learning rate. Adaptive methods go further: they maintain a per-parameter learning rate, so rarely-updated parameters get big updates and frequently-updated ones get small ones.
AdaGrad (Duchi et al., 2011)
Accumulate squared gradients per parameter; divide the update by the square root of the accumulation.
Great for sparse features (NLP with bag-of-words) because parameters that are rarely touched keep large learning rates. Weakness: $G_t$ grows without bound, so the effective learning rate decays to zero over time.
RMSProp (Hinton, unpublished 2012)
Fixes AdaGrad's decay problem by replacing the sum with an exponential moving average:
Now $G_t$ represents recent squared gradients, not lifetime totals. The learning rate no longer monotonically shrinks.
Adam (Kingma & Ba, 2014)
Combines momentum (first moment) and RMSProp (second moment), plus a bias correction to account for initialization at zero:
Typical defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$. Adam is the default optimizer for almost everything in deep learning; it "just works" on a wide range of problems without much tuning.
SGD: simplest, noisiest, often best generalization with enough tuning.
SGD + Momentum: SGD but with memory. The go-to for vision and most "big" training runs.
AdaGrad: per-parameter rates, but decays too aggressively.
RMSProp: fixes AdaGrad's decay. Common in RL and RNN training.
Adam: momentum + RMSProp + bias correction. The safe default.
AdamW: Adam with decoupled weight decay. The actual modern default for transformers.
Source — RMSProp
function rmsprop(grad_f, x0, lr, rho, eps, n_iter):
x = x0
G = zeros_like(x) # accumulated squared gradients
for t in 1..n_iter:
g = grad_f(x)
G = rho * G + (1 - rho) * g^2 # exponential moving average
x = x - lr / sqrt(G + eps) * g # adaptive step
return x
# rho: decay rate, typically 0.9
# eps: small constant for numerical stability, typically 1e-8
# Adaptive learning rate per parameter:
# large G_i => small effective lr for param i
# small G_i => large effective lr for param iimport numpy as np
def rmsprop(grad_f, x0, lr=0.01, rho=0.9, eps=1e-8, n_iter=200, tol=1e-7):
"""
RMSProp optimizer.
grad_f: callable(x) -> gradient
rho: decay rate for squared-gradient accumulation (default 0.9)
eps: numerical stability constant (default 1e-8)
"""
x = np.array(x0, dtype=float)
G = np.zeros_like(x) # accumulated squared gradients
for t in range(n_iter):
g = grad_f(x)
G = rho * G + (1 - rho) * g ** 2
x = x - lr / np.sqrt(G + eps) * g
if np.linalg.norm(g) < tol:
print(f"Converged at iter {t+1}")
break
return x
# 2D stretched bowl: L(x) = x1^2 + 10*x2^2
def grad_bowl(x): return np.array([2*x[0], 20*x[1]])
x_star = rmsprop(grad_bowl, [3.0, 2.0], lr=0.1, rho=0.9, n_iter=300)
print(f"Minimum: {x_star}") # should be near [0, 0]function rmsprop(gradF, x0, lr = 0.01, rho = 0.9, eps = 1e-8, nIter = 200, tol = 1e-7) {
let x = [...x0];
let G = new Array(x0.length).fill(0); // accumulated squared gradients
for (let t = 0; t < nIter; t++) {
const g = gradF(x);
const norm = Math.sqrt(g.reduce((s, gi) => s + gi * gi, 0));
if (norm < tol) { console.log(`Converged at iter ${t+1}`); break; }
G = G.map((gi, i) => rho * gi + (1 - rho) * g[i] * g[i]);
x = x.map((xi, i) => xi - (lr / Math.sqrt(G[i] + eps)) * g[i]);
}
return x;
}
// 2D stretched bowl
const gradBowl = ([x1, x2]) => [2*x1, 20*x2];
const result = rmsprop(gradBowl, [3, 2], 0.1, 0.9, 1e-8, 300);
console.log("Minimum:", result.map(v => v.toFixed(6)));#include <stdio.h>
#include <math.h>
#include <string.h>
#define NDIM 2
void rmsprop(
void (*grad_f)(const double *, double *, int),
double *x, int n,
double lr, double rho, double eps,
int n_iter, double tol)
{
double G[NDIM], g[NDIM];
memset(G, 0, sizeof(double) * n);
for (int t = 0; t < n_iter; t++) {
grad_f(x, g, n);
double norm = 0;
for (int i = 0; i < n; i++) {
G[i] = rho * G[i] + (1.0 - rho) * g[i] * g[i];
x[i] -= (lr / sqrt(G[i] + eps)) * g[i];
norm += g[i] * g[i];
}
if (sqrt(norm) < tol) {
printf("Converged at iter %d\n", t + 1);
break;
}
}
}
void grad_bowl(const double *x, double *g, int n) {
g[0] = 2 * x[0];
g[1] = 20 * x[1];
}
int main(void) {
double x[NDIM] = {3.0, 2.0};
rmsprop(grad_bowl, x, NDIM, 0.1, 0.9, 1e-8, 300, 1e-7);
printf("Minimum: [%.6f, %.6f]\n", x[0], x[1]);
return 0;
}#include <cmath>
#include <vector>
#include <functional>
#include <iostream>
using Vec = std::vector<double>;
Vec rmsprop(
std::function<Vec(const Vec&)> gradF,
Vec x, double lr = 0.01, double rho = 0.9, double eps = 1e-8,
int nIter = 200, double tol = 1e-7)
{
Vec G(x.size(), 0.0);
for (int t = 0; t < nIter; t++) {
Vec g = gradF(x);
double norm = 0;
for (size_t i = 0; i < x.size(); i++) {
G[i] = rho * G[i] + (1 - rho) * g[i] * g[i];
x[i] -= (lr / std::sqrt(G[i] + eps)) * g[i];
norm += g[i] * g[i];
}
if (std::sqrt(norm) < tol) {
std::cout << "Converged at iter " << t + 1 << "\n";
break;
}
}
return x;
}
int main() {
auto gradBowl = [](const Vec& x) { return Vec{2*x[0], 20*x[1]}; };
auto x = rmsprop(gradBowl, {3.0, 2.0}, 0.1, 0.9, 1e-8, 300);
std::cout << "Minimum: [" << x[0] << ", " << x[1] << "]\n";
}import java.util.Arrays;
public class RMSProp {
@FunctionalInterface
interface VecFunc { double[] apply(double[] x); }
static double[] rmsprop(
VecFunc gradF, double[] x0,
double lr, double rho, double eps, int nIter, double tol) {
double[] x = x0.clone();
double[] G = new double[x0.length]; // all zeros
for (int t = 0; t < nIter; t++) {
double[] g = gradF.apply(x);
double norm = 0;
for (int i = 0; i < x.length; i++) {
G[i] = rho * G[i] + (1 - rho) * g[i] * g[i];
x[i] -= (lr / Math.sqrt(G[i] + eps)) * g[i];
norm += g[i] * g[i];
}
if (Math.sqrt(norm) < tol) {
System.out.println("Converged at iter " + (t + 1));
break;
}
}
return x;
}
public static void main(String[] args) {
double[] x = rmsprop(
x -> new double[]{2*x[0], 20*x[1]},
new double[]{3.0, 2.0},
0.1, 0.9, 1e-8, 300, 1e-7);
System.out.printf("Minimum: [%.6f, %.6f]%n", x[0], x[1]);
}
}package main
import (
"fmt"
"math"
)
func rmsprop(
gradF func([]float64) []float64,
x0 []float64, lr, rho, eps float64,
nIter int, tol float64) []float64 {
x := make([]float64, len(x0))
copy(x, x0)
G := make([]float64, len(x0))
for t := 0; t < nIter; t++ {
g := gradF(x)
norm := 0.0
for i := range x {
G[i] = rho*G[i] + (1-rho)*g[i]*g[i]
x[i] -= (lr / math.Sqrt(G[i]+eps)) * g[i]
norm += g[i] * g[i]
}
if math.Sqrt(norm) < tol {
fmt.Printf("Converged at iter %d\n", t+1)
break
}
}
return x
}
func main() {
gradBowl := func(x []float64) []float64 {
return []float64{2 * x[0], 20 * x[1]}
}
x := rmsprop(gradBowl, []float64{3.0, 2.0}, 0.1, 0.9, 1e-8, 300, 1e-7)
fmt.Printf("Minimum: %v\n", x)
}Source — Adam Optimizer
function adam(grad_f, x0, lr, beta1, beta2, eps, n_iter):
x = x0
m = zeros_like(x) # first moment (momentum)
v = zeros_like(x) # second moment (RMSProp-like)
for t in 1..n_iter:
g = grad_f(x)
m = beta1 * m + (1 - beta1) * g # update biased 1st moment
v = beta2 * v + (1 - beta2) * g^2 # update biased 2nd moment
m_hat = m / (1 - beta1^t) # bias correction
v_hat = v / (1 - beta2^t) # bias correction
x = x - lr / (sqrt(v_hat) + eps) * m_hat
return x
# Typical defaults: beta1=0.9, beta2=0.999, eps=1e-8
# The bias corrections matter most in the first few steps when
# m and v are still close to zero (initialized at zero).import numpy as np
def adam(grad_f, x0, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8,
n_iter=1000, tol=1e-7):
"""
Adam optimizer (Kingma & Ba, 2014).
grad_f: callable(x) -> gradient
beta1: first-moment decay (default 0.9)
beta2: second-moment decay (default 0.999)
eps: numerical stability (default 1e-8)
"""
x = np.array(x0, dtype=float)
m = np.zeros_like(x) # first moment
v = np.zeros_like(x) # second moment
for t in range(1, n_iter + 1):
g = grad_f(x)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g ** 2
m_hat = m / (1 - beta1 ** t) # bias correction
v_hat = v / (1 - beta2 ** t) # bias correction
x = x - lr / (np.sqrt(v_hat) + eps) * m_hat
if np.linalg.norm(g) < tol:
print(f"Converged at iter {t}")
break
return x
# 2D stretched bowl: L(x) = x1^2 + 10*x2^2
def grad_bowl(x): return np.array([2*x[0], 20*x[1]])
x_star = adam(grad_bowl, [3.0, 2.0], lr=0.1, n_iter=500)
print(f"Minimum: {x_star}") # should be near [0, 0]function adam(gradF, x0,
lr = 1e-3, beta1 = 0.9, beta2 = 0.999, eps = 1e-8,
nIter = 1000, tol = 1e-7)
{
let x = [...x0];
let m = new Array(x0.length).fill(0); // first moment
let v = new Array(x0.length).fill(0); // second moment
for (let t = 1; t <= nIter; t++) {
const g = gradF(x);
const norm = Math.sqrt(g.reduce((s, gi) => s + gi * gi, 0));
if (norm < tol) { console.log(`Converged at iter ${t}`); break; }
m = m.map((mi, i) => beta1 * mi + (1 - beta1) * g[i]);
v = v.map((vi, i) => beta2 * vi + (1 - beta2) * g[i] * g[i]);
const mHat = m.map(mi => mi / (1 - beta1 ** t));
const vHat = v.map(vi => vi / (1 - beta2 ** t));
x = x.map((xi, i) => xi - (lr / (Math.sqrt(vHat[i]) + eps)) * mHat[i]);
}
return x;
}
const gradBowl = ([x1, x2]) => [2*x1, 20*x2];
const result = adam(gradBowl, [3, 2], 0.1, 0.9, 0.999, 1e-8, 500);
console.log("Minimum:", result.map(v => v.toFixed(6)));#include <stdio.h>
#include <math.h>
#include <string.h>
#define NDIM 2
void adam(
void (*grad_f)(const double *, double *, int),
double *x, int n,
double lr, double beta1, double beta2, double eps,
int n_iter, double tol)
{
double m[NDIM], v[NDIM], g[NDIM];
memset(m, 0, sizeof(double) * n);
memset(v, 0, sizeof(double) * n);
for (int t = 1; t <= n_iter; t++) {
grad_f(x, g, n);
double norm = 0, b1t = pow(beta1, t), b2t = pow(beta2, t);
for (int i = 0; i < n; i++) {
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i];
v[i] = beta2 * v[i] + (1.0 - beta2) * g[i] * g[i];
double m_hat = m[i] / (1.0 - b1t);
double v_hat = v[i] / (1.0 - b2t);
x[i] -= lr / (sqrt(v_hat) + eps) * m_hat;
norm += g[i] * g[i];
}
if (sqrt(norm) < tol) {
printf("Converged at iter %d\n", t);
break;
}
}
}
void grad_bowl(const double *x, double *g, int n) {
g[0] = 2 * x[0];
g[1] = 20 * x[1];
}
int main(void) {
double x[NDIM] = {3.0, 2.0};
adam(grad_bowl, x, NDIM, 0.1, 0.9, 0.999, 1e-8, 500, 1e-7);
printf("Minimum: [%.6f, %.6f]\n", x[0], x[1]);
return 0;
}#include <cmath>
#include <vector>
#include <functional>
#include <iostream>
using Vec = std::vector<double>;
Vec adam(
std::function<Vec(const Vec&)> gradF,
Vec x, double lr = 1e-3,
double beta1 = 0.9, double beta2 = 0.999, double eps = 1e-8,
int nIter = 1000, double tol = 1e-7)
{
Vec m(x.size(), 0.0), v(x.size(), 0.0);
for (int t = 1; t <= nIter; t++) {
Vec g = gradF(x);
double norm = 0;
double b1t = std::pow(beta1, t), b2t = std::pow(beta2, t);
for (size_t i = 0; i < x.size(); i++) {
m[i] = beta1 * m[i] + (1 - beta1) * g[i];
v[i] = beta2 * v[i] + (1 - beta2) * g[i] * g[i];
double mHat = m[i] / (1 - b1t);
double vHat = v[i] / (1 - b2t);
x[i] -= lr / (std::sqrt(vHat) + eps) * mHat;
norm += g[i] * g[i];
}
if (std::sqrt(norm) < tol) {
std::cout << "Converged at iter " << t << "\n";
break;
}
}
return x;
}
int main() {
auto gradBowl = [](const Vec& x) { return Vec{2*x[0], 20*x[1]}; };
auto x = adam(gradBowl, {3.0, 2.0}, 0.1, 0.9, 0.999, 1e-8, 500);
std::cout << "Minimum: [" << x[0] << ", " << x[1] << "]\n";
}public class Adam {
@FunctionalInterface
interface VecFunc { double[] apply(double[] x); }
static double[] adam(
VecFunc gradF, double[] x0,
double lr, double beta1, double beta2, double eps,
int nIter, double tol) {
double[] x = x0.clone();
double[] m = new double[x0.length]; // first moment
double[] v = new double[x0.length]; // second moment
for (int t = 1; t <= nIter; t++) {
double[] g = gradF.apply(x);
double norm = 0;
double b1t = Math.pow(beta1, t), b2t = Math.pow(beta2, t);
for (int i = 0; i < x.length; i++) {
m[i] = beta1 * m[i] + (1 - beta1) * g[i];
v[i] = beta2 * v[i] + (1 - beta2) * g[i] * g[i];
double mHat = m[i] / (1 - b1t);
double vHat = v[i] / (1 - b2t);
x[i] -= lr / (Math.sqrt(vHat) + eps) * mHat;
norm += g[i] * g[i];
}
if (Math.sqrt(norm) < tol) {
System.out.println("Converged at iter " + t);
break;
}
}
return x;
}
public static void main(String[] args) {
double[] x = adam(
a -> new double[]{2*a[0], 20*a[1]},
new double[]{3.0, 2.0},
0.1, 0.9, 0.999, 1e-8, 500, 1e-7);
System.out.printf("Minimum: [%.6f, %.6f]%n", x[0], x[1]);
}
}package main
import (
"fmt"
"math"
)
func adamOptimizer(
gradF func([]float64) []float64,
x0 []float64,
lr, beta1, beta2, eps float64,
nIter int, tol float64) []float64 {
x := make([]float64, len(x0))
copy(x, x0)
m := make([]float64, len(x0))
v := make([]float64, len(x0))
for t := 1; t <= nIter; t++ {
g := gradF(x)
norm := 0.0
b1t := math.Pow(beta1, float64(t))
b2t := math.Pow(beta2, float64(t))
for i := range x {
m[i] = beta1*m[i] + (1-beta1)*g[i]
v[i] = beta2*v[i] + (1-beta2)*g[i]*g[i]
mHat := m[i] / (1 - b1t)
vHat := v[i] / (1 - b2t)
x[i] -= lr / (math.Sqrt(vHat) + eps) * mHat
norm += g[i] * g[i]
}
if math.Sqrt(norm) < tol {
fmt.Printf("Converged at iter %d\n", t)
break
}
}
return x
}
func main() {
gradBowl := func(x []float64) []float64 {
return []float64{2 * x[0], 20 * x[1]}
}
x := adamOptimizer(gradBowl, []float64{3.0, 2.0},
0.1, 0.9, 0.999, 1e-8, 500, 1e-7)
fmt.Printf("Minimum: %v\n", x)
}10. Non-convex landscapes (the actual setting)
Neural network losses are very non-convex. For any network with $H$ hidden units in a layer, there are at least $H!$ equivalent minima just from permuting the hidden units — the loss surface has symmetries. There are also saddle points, ridges, plateaus, and occasional narrow valleys.
Three empirical findings about deep networks that deserve to be surprising:
- Most "local minima" are nearly as good as the global minimum. In high dimensions, a random critical point is overwhelmingly likely to be a saddle, not a local minimum. And the local minima that exist tend to all have similar loss values.
- Overparameterization helps. Larger networks have smoother loss landscapes (fewer bad minima, more connected level sets). This is part of why "just make the model bigger" keeps working.
- Generalization does not require finding the global minimum. Training to an early stopping point, or a flat minimum, often generalizes better than grinding all the way down.
The mathematical picture of gradient descent as "roll the ball down the bowl until it reaches the bottom" is a caricature, but it's a useful one. Just remember that the bowl is $10^9$-dimensional, has $10^{400}$ local minima, and you are sampling the gradient from a noisy mini-batch. Despite all that, it works.
11. Summary
Gradient descent updates parameters by a scaled step against the gradient of the loss: $\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla L$. The learning rate $\eta$ controls everything — too small and nothing moves, too large and you diverge. In practice we use mini-batch stochastic gradient descent, often augmented with momentum (running average of past gradients) or full adaptive per-parameter learning rates (Adam, AdamW). The loss landscape of a neural network is extravagantly non-convex, but the optimizers find good-enough solutions anyway — a fact that remains one of the more miraculous features of modern deep learning.
Where to go next
- → Backpropagation — where the $\nabla L$ in the update rule actually comes from.
- → The Perceptron — historical precursor, but the learning rule is genuinely different from gradient descent.
- Deep Learning Book Chapter 8 — rigorous coverage of optimization for deep nets.
- Adam: A Method for Stochastic Optimization (Kingma & Ba, 2014)
- Distill: Why Momentum Really Works — gorgeous interactive explanation.