Mathematical Optimization

Pick variables to make a number as small as possible, subject to rules. That's the entire field — and it's the engine under every neural network, every airline schedule, every portfolio, and every routing system. By the end of this page you'll know the language, the main algorithms, and where to look when your problem doesn't fit a textbook.

Prereq: multivariable calculus, linear algebra Read time: ~30 min Interactive figures: 1 Code: NumPy, SciPy

1. Why optimization is the most important applied math topic

If you only had time to learn one branch of applied math in 2026, this is the one. Look around:

Every ML training run is optimization. A neural network learns by minimizing a loss function over its weights. GPT-4, AlphaFold, Stable Diffusion — they're all the output of an optimizer that ran for weeks on a billion-dimensional landscape.
Every bank runs LPs all day. Asset-liability matching, portfolio rebalancing, FX netting, credit limit allocation — almost all formulated as linear programs and solved by industrial solvers (Gurobi, CPLEX, HiGHS).
Every airline schedules with integer programs. Crew rostering, gate assignment, fleet routing, fare class management. United alone solves problems with millions of variables before breakfast.
Every routing app is shortest-paths plus optimization. Google Maps, Uber dispatch, Amazon last-mile delivery. The ETA you see is the answer to a constrained optimization problem updated every few seconds.
Every controller is optimization. Robot arms, drone autopilots, MPC for autonomous cars, even your thermostat — all minimize a tracking error over predicted future states.

The whole field has one shape: here are some knobs, here is a number you want small, here are some rules you can't break, find the best knob settings. The trick is that "best" can mean a billion different things and the rules can be linear, quadratic, smooth, discrete, stochastic, or downright nasty. Different shapes get different algorithms. This page is the map.

If you've used gradient descent before but never thought of it as a special case of a much bigger field, this page is the bridge. If you've never seen it, that's fine too — we'll build it from scratch.

2. Anatomy of an optimization problem

Every optimization problem has three pieces. Once you can name them in any problem you see, you've already won half the battle.

Decision variables — the knobs you control. Usually written $\mathbf{x} \in \mathbb{R}^n$. They could be neural network weights, money allocated to assets, on/off switches for power plants, or angles of a robot arm.
Objective function — the number you want to make small (or large). Written $f(\mathbf{x})$. Think loss, cost, negative profit, energy, residual error.
Constraints — the rules. Things like "weights must sum to 1," "power output can't be negative," "this beam can't bend more than 2 cm." Written as equalities $g_i(\mathbf{x}) = 0$ or inequalities $h_j(\mathbf{x}) \leq 0$.

The general form looks like this:

\begin{aligned} \min_{\mathbf{x} \in \mathbb{R}^n} \quad & f(\mathbf{x}) \\ \text{subject to} \quad & g_i(\mathbf{x}) = 0, \quad i = 1, \dots, m \\ & h_j(\mathbf{x}) \leq 0, \quad j = 1, \dots, p \end{aligned}

The general nonlinear program

$\min$: "Minimize." Maximization is the same problem with $-f$, so by convention we always write minimization.
$\mathbf{x}$: The decision vector — the unknowns you're solving for. Bold because it's a vector, not a single number.
$\mathbb{R}^n$: $n$-dimensional real space — the domain you're searching over. $n$ might be 2 (two knobs) or $10^{12}$ (a large language model's weights).
$f(\mathbf{x})$: The objective function. Takes a candidate vector, returns a single number you want small.
$g_i(\mathbf{x}) = 0$: An equality constraint. Forces $\mathbf{x}$ onto a surface where $g_i$ vanishes — for example "the budget is exactly $1 million."
$h_j(\mathbf{x}) \leq 0$: An inequality constraint. Forces $\mathbf{x}$ into a half-space — for example "you can't buy a negative number of shares."
$m, p$: How many equality and inequality constraints you have, respectively. Either can be zero.

Analogy. Imagine a hiker on a hilly landscape who wants to reach the lowest point. The hills are $f$. A rope tied to a tree limits where they can walk — that's an equality constraint (they have to stay on the rope). A fence on one side they can't cross — that's an inequality constraint. The lowest point on the rope, behind the fence, is the optimum.

The set of all $\mathbf{x}$ that satisfy every constraint is called the feasible region, written $\mathcal{F}$. If $\mathcal{F}$ is empty, the problem is infeasible. If $f$ goes to $-\infty$ on $\mathcal{F}$, the problem is unbounded. Both are real things that happen, and a good solver will tell you which.

Vocabulary cheat-sheet

Symbol	Name	What it is
$\mathbf{x}$	decision vector	The variables you optimize over
$f$	objective	The function you minimize
$\nabla f$	gradient	Vector of first partial derivatives — points uphill
$\nabla^2 f$ or $H$	Hessian	Matrix of second partial derivatives — measures curvature
$\mathbf{x}^\star$	optimum	An $\mathbf{x}$ that minimizes $f$
$\mathcal{F}$	feasible region	Set of $\mathbf{x}$ satisfying all constraints
$\eta$	step size / learning rate	How far you move per iteration
$\boldsymbol{\lambda}, \boldsymbol{\mu}$	multipliers	Scalar penalties on constraints (Lagrange / KKT)

3. Unconstrained optimization — the calculus crash course

Start with the simplest case: no constraints. The problem is just

\min_{\mathbf{x} \in \mathbb{R}^n} \, f(\mathbf{x})

Unconstrained minimization

$\min$: Find the smallest value.
$\mathbf{x} \in \mathbb{R}^n$: $\mathbf{x}$ is any vector of $n$ real numbers — no rules.
$f(\mathbf{x})$: A scalar-valued function of $\mathbf{x}$ that you want small.

Why this matters. Most ML training is unconstrained — you let the weights wander wherever they want. So this case is the workhorse, even though it sounds restrictive.

From multivariable calculus, the necessary condition for a minimum is that the gradient vanishes. Intuitively, if any partial derivative were nonzero you could nudge $\mathbf{x}$ in that direction and make $f$ smaller. So a minimum must be a flat spot — a critical point.

\nabla f(\mathbf{x}^\star) = \mathbf{0}

First-order necessary condition

$\nabla f$: The gradient — a vector whose $i$-th entry is $\partial f / \partial x_i$, the partial derivative of $f$ with respect to coordinate $i$.
$\mathbf{x}^\star$: A candidate optimum. The star is just convention.
$\mathbf{0}$: The zero vector — every component is zero.

Analogy. On a smooth landscape, the bottom of a valley, the top of a hill, and the saddle of a mountain pass all have zero slope in every direction. The slope being zero is necessary for "lowest point," but not sufficient — it could be any of those three things.

To rule out hilltops and saddles you check the second derivative. The second-order condition says the Hessian — the matrix of second partials — must be positive semi-definite at a minimum:

\nabla^2 f(\mathbf{x}^\star) \succeq 0

Second-order necessary condition

$\nabla^2 f$: The Hessian — the $n \times n$ matrix whose $(i,j)$ entry is $\partial^2 f / \partial x_i \partial x_j$. It encodes how the gradient changes as you move.
$\succeq 0$: "Positive semi-definite." A symmetric matrix $H$ is PSD if $\mathbf{v}^\top H \mathbf{v} \geq 0$ for every vector $\mathbf{v}$. Equivalently, all eigenvalues are non-negative.

Why PSD. If you Taylor-expand $f$ around $\mathbf{x}^\star$, the leading correction is $\tfrac{1}{2} \mathbf{v}^\top H \mathbf{v}$ for a small step $\mathbf{v}$. For $\mathbf{x}^\star$ to be a minimum, every direction has to make $f$ go up, so that quadratic term must be $\geq 0$ in every direction. That's exactly what PSD says.

If the Hessian is strictly positive definite (all eigenvalues $> 0$), $\mathbf{x}^\star$ is a strict local minimum. If it has a negative eigenvalue, $\mathbf{x}^\star$ is a saddle point — there's some direction in which you can decrease $f$ further. We'll come back to saddles when we discuss neural networks.

Local vs global

A point $\mathbf{x}^\star$ is a local minimum if $f(\mathbf{x}^\star) \leq f(\mathbf{x})$ for all $\mathbf{x}$ in some neighborhood. It's a global minimum if the inequality holds for all $\mathbf{x}$ at all. Local minima are easy — just walk downhill until the ground is flat. Global minima can be wildly hard — you might have to check infinitely many local valleys. The single most important question in optimization is: does my problem only have one valley? When the answer is yes, life is good. The technical name for that condition is convexity.

Source — Unconstrained Optimization

// f(x) = x^2 + 2x + 1 = (x+1)^2, minimum at x = -1
// f'(x) = 2x + 2

function gradient_descent(x0, lr, n):
    x = x0
    for i = 1 to n:
        grad = 2*x + 2
        x = x - lr * grad
    return x

function newton_1d(x0, n):
    x = x0
    for i = 1 to n:
        grad   = 2*x + 2    // f'(x)
        hess   = 2           // f''(x)
        x = x - grad / hess  // one Newton step
    return x

// gradient_descent(-5, 0.1, 50) => ~-1.0
// newton_1d(-5, 3)              => -1.0 (exact in 1 step for quadratic)

def f(x):
    return x**2 + 2*x + 1   # (x+1)^2, minimum at x = -1

def f_grad(x):
    return 2*x + 2

def f_hess(x):
    return 2.0               # constant second derivative

def gradient_descent(f_grad, x0, lr=0.1, n=50):
    x = x0
    history = [x]
    for _ in range(n):
        x = x - lr * f_grad(x)
        history.append(x)
    return x, history

def newton_1d(f_grad, f_hess, x0, n=10):
    x = x0
    for _ in range(n):
        x = x - f_grad(x) / f_hess(x)
    return x

x_gd, hist = gradient_descent(f_grad, x0=-5.0, lr=0.1, n=50)
print(f"GD result: x = {x_gd:.6f}, f(x) = {f(x_gd):.6f}")
# => x = -1.000000

x_nwt = newton_1d(f_grad, f_hess, x0=-5.0)
print(f"Newton result: x = {x_nwt:.6f}")
# => x = -1.000000  (exact for quadratic)

function f(x)      { return x*x + 2*x + 1; }
function fGrad(x)  { return 2*x + 2; }
function fHess()   { return 2.0; }

function gradientDescent(fGrad, x0, lr = 0.1, n = 50) {
  let x = x0;
  const history = [x];
  for (let i = 0; i < n; i++) {
    x = x - lr * fGrad(x);
    history.push(x);
  }
  return { x, history };
}

function newton1D(fGrad, fHess, x0, n = 10) {
  let x = x0;
  for (let i = 0; i < n; i++) {
    x = x - fGrad(x) / fHess(x);
  }
  return x;
}

const { x: xGD } = gradientDescent(fGrad, -5, 0.1, 50);
console.log("GD:", xGD.toFixed(6));   // => -1.000000

const xNwt = newton1D(fGrad, fHess, -5, 5);
console.log("Newton:", xNwt.toFixed(6)); // => -1.000000

#include <stdio.h>
#include <math.h>

double f(double x)     { return x*x + 2*x + 1; }
double f_grad(double x){ return 2*x + 2; }
double f_hess()        { return 2.0; }

double gradient_descent(double x0, double lr, int n) {
    double x = x0;
    for (int i = 0; i < n; i++) {
        x = x - lr * f_grad(x);
    }
    return x;
}

double newton_1d(double x0, int n) {
    double x = x0;
    for (int i = 0; i < n; i++) {
        x = x - f_grad(x) / f_hess();
    }
    return x;
}

int main(void) {
    double xGD  = gradient_descent(-5.0, 0.1, 50);
    double xNwt = newton_1d(-5.0, 10);
    printf("GD:     x = %.6f, f = %.6f\n", xGD,  f(xGD));
    printf("Newton: x = %.6f, f = %.6f\n", xNwt, f(xNwt));
    return 0;
}

#include <iostream>
#include <vector>
#include <functional>
#include <cmath>

double f(double x)     { return x*x + 2*x + 1; }
double fGrad(double x) { return 2*x + 2; }
double fHess(double)   { return 2.0; }

double gradientDescent(std::function<double(double)> grad,
                       double x0, double lr, int n) {
    double x = x0;
    for (int i = 0; i < n; i++) x -= lr * grad(x);
    return x;
}

double newton1D(std::function<double(double)> grad,
                std::function<double(double)> hess,
                double x0, int n) {
    double x = x0;
    for (int i = 0; i < n; i++) x -= grad(x) / hess(x);
    return x;
}

int main() {
    double xGD  = gradientDescent(fGrad, -5.0, 0.1, 50);
    double xNwt = newton1D(fGrad, fHess, -5.0, 10);
    std::cout << "GD:     x = " << xGD  << ", f = " << f(xGD)  << "\n";
    std::cout << "Newton: x = " << xNwt << ", f = " << f(xNwt) << "\n";
}

public class UnconstrainedOpt {
    static double f(double x)     { return x*x + 2*x + 1; }
    static double fGrad(double x) { return 2*x + 2; }
    static double fHess()         { return 2.0; }

    static double gradientDescent(double x0, double lr, int n) {
        double x = x0;
        for (int i = 0; i < n; i++) x -= lr * fGrad(x);
        return x;
    }

    static double newton1D(double x0, int n) {
        double x = x0;
        for (int i = 0; i < n; i++) x -= fGrad(x) / fHess();
        return x;
    }

    public static void main(String[] args) {
        double xGD  = gradientDescent(-5.0, 0.1, 50);
        double xNwt = newton1D(-5.0, 10);
        System.out.printf("GD:     x = %.6f, f = %.6f%n", xGD,  f(xGD));
        System.out.printf("Newton: x = %.6f, f = %.6f%n", xNwt, f(xNwt));
    }
}

package main

import "fmt"

func f(x float64) float64     { return x*x + 2*x + 1 }
func fGrad(x float64) float64 { return 2*x + 2 }
func fHess() float64          { return 2.0 }

func gradientDescent(x0, lr float64, n int) float64 {
	x := x0
	for i := 0; i < n; i++ {
		x -= lr * fGrad(x)
	}
	return x
}

func newton1D(x0 float64, n int) float64 {
	x := x0
	for i := 0; i < n; i++ {
		x -= fGrad(x) / fHess()
	}
	return x
}

func main() {
	xGD  := gradientDescent(-5.0, 0.1, 50)
	xNwt := newton1D(-5.0, 10)
	fmt.Printf("GD:     x = %.6f, f = %.6f\n", xGD,  f(xGD))
	fmt.Printf("Newton: x = %.6f, f = %.6f\n", xNwt, f(xNwt))
}

4. Convexity — the dividing line between easy and hard

Convex problems are the ones we know how to solve. Non-convex problems are the ones we mostly pray about. The line between them is sharp, and it's the single most important taxonomy in the field.

Convex sets

A set $\mathcal{C} \subseteq \mathbb{R}^n$ is convex if the line segment between any two points in $\mathcal{C}$ stays inside $\mathcal{C}$. Formally:

\mathbf{x}, \mathbf{y} \in \mathcal{C} \;\Longrightarrow\; \theta \mathbf{x} + (1 - \theta) \mathbf{y} \in \mathcal{C}, \quad \forall \theta \in [0, 1]

Convex set

$\mathcal{C}$: A subset of $\mathbb{R}^n$ — a region in space.
$\mathbf{x}, \mathbf{y}$: Any two points in that region.
$\theta$: A weight between 0 and 1. As $\theta$ moves from 1 to 0, the combination $\theta \mathbf{x} + (1-\theta)\mathbf{y}$ slides from $\mathbf{x}$ to $\mathbf{y}$ along the straight line.
$\Longrightarrow$: "Implies." The condition has to hold for every choice of $\mathbf{x}, \mathbf{y}, \theta$.

Picture. A disc is convex; a donut is not (the line between two points on opposite rims passes through the hole). A triangle, a square, a half-plane, the positive orthant $\{\mathbf{x} : \mathbf{x} \geq 0\}$, and any intersection of half-spaces are all convex.

Convex functions

A function $f : \mathbb{R}^n \to \mathbb{R}$ is convex if its graph lies below every chord. Formally:

f(\theta \mathbf{x} + (1 - \theta) \mathbf{y}) \;\leq\; \theta f(\mathbf{x}) + (1 - \theta) f(\mathbf{y}), \quad \forall \theta \in [0, 1]

Convex function

$f$: A function that takes a vector and returns a scalar.
$\mathbf{x}, \mathbf{y}$: Any two points in the domain.
$\theta$: Mixing coefficient between 0 and 1.
LHS: The function evaluated at the midpoint along the line from $\mathbf{x}$ to $\mathbf{y}$.
RHS: The corresponding average of the function values — the height of the chord above that midpoint.

Picture. Draw a parabola $f(x) = x^2$. Pick any two points on it; draw the straight line between them. The parabola is below that line. That's convexity in one dimension. In higher dimensions it means: a bowl, not a saddle, not a wavy carpet.

This inequality is called Jensen's inequality in its simplest form. It generalizes to expectations: $f(\mathbb{E}[X]) \leq \mathbb{E}[f(X)]$ for any random variable $X$ and convex $f$ — a fact you'll meet again in probability and information theory.

If $f$ is twice differentiable, an equivalent condition is that the Hessian is PSD everywhere:

\nabla^2 f(\mathbf{x}) \succeq 0, \quad \forall \mathbf{x} \in \mathbb{R}^n

Twice-differentiable convexity

$\nabla^2 f(\mathbf{x})$: The Hessian matrix of $f$ at $\mathbf{x}$.
$\succeq 0$: Positive semi-definite at every point in the domain — the function is curving upward (or flat) in every direction at every point.

Why convexity is the dividing line. For a convex problem, every local minimum is a global minimum. There are no other valleys to check. So a local solver that just walks downhill is guaranteed to find the right answer. For non-convex problems, no such guarantee exists, and we have to live with heuristics, restarts, and luck.

The fundamental theorem you should walk away with:

For a convex objective on a convex feasible set, every local minimum is a global minimum. Walk downhill from anywhere and you've solved it.

This is why people work so hard to convexify problems — replacing a nasty $\ell_0$ sparsity penalty with $\ell_1$ (LASSO), or relaxing integer constraints to continuous ones, or fitting an SVM by minimizing a hinge loss. Convexity is the prize, and it's worth bending the problem to get it.

Source — Convexity Check

// 1D convexity: check d²f/dx² >= 0 numerically at sample points
function is_convex_1d(f, a, b, n_samples):
    h = (b - a) / n_samples
    for i = 1 to n_samples - 1:
        x = a + i * h
        d2f = (f(x+h) - 2*f(x) + f(x-h)) / h^2
        if d2f < -1e-9:
            return false
    return true

// 2D convexity: check all eigenvalues of Hessian >= 0
// Hessian of f(x,y) = [f_xx, f_xy; f_xy, f_yy]
// A 2x2 matrix is PSD iff trace >= 0 AND determinant >= 0
function is_psd_2x2(fxx, fxy, fyy):
    trace = fxx + fyy
    det   = fxx * fyy - fxy * fxy
    return trace >= 0 and det >= 0

import numpy as np

# 1D convexity: check second derivative numerically
def is_convex_1d(f, a, b, n_samples=200):
    xs = np.linspace(a, b, n_samples)
    h  = xs[1] - xs[0]
    d2f = (f(xs[2:]) - 2*f(xs[1:-1]) + f(xs[:-2])) / h**2
    return bool(np.all(d2f >= -1e-9))

# 2D convexity: check Hessian PSD via eigenvalues
def is_convex_2d(f, center, eps=1e-5):
    """Numerically estimate Hessian at center and check PSD."""
    x, y = center
    fxx = (f(x+eps,y) - 2*f(x,y) + f(x-eps,y)) / eps**2
    fyy = (f(x,y+eps) - 2*f(x,y) + f(x,y-eps)) / eps**2
    fxy = (f(x+eps,y+eps) - f(x+eps,y-eps)
           - f(x-eps,y+eps) + f(x-eps,y-eps)) / (4*eps**2)
    H = np.array([[fxx, fxy], [fxy, fyy]])
    eigvals = np.linalg.eigvalsh(H)
    return bool(np.all(eigvals >= -1e-9)), eigvals

# Examples
print(is_convex_1d(lambda x: x**2,    -3, 3))  # True  (parabola)
print(is_convex_1d(lambda x: -x**2,   -3, 3))  # False (concave)
print(is_convex_1d(lambda x: abs(x),  -3, 3))  # True  (convex, non-smooth)

f2d = lambda x, y: x**2 + y**2       # bowl — convex
print(is_convex_2d(f2d, (0.0, 0.0)))  # (True, [2., 2.])

g2d = lambda x, y: x**2 - y**2       # saddle — not convex
print(is_convex_2d(g2d, (0.0, 0.0)))  # (False, [-2., 2.])

// 1D convexity: check d²f/dx² >= 0 numerically
function isConvex1D(f, a, b, nSamples = 200) {
  const h = (b - a) / nSamples;
  for (let i = 1; i < nSamples - 1; i++) {
    const x  = a + i * h;
    const d2 = (f(x + h) - 2 * f(x) + f(x - h)) / (h * h);
    if (d2 < -1e-9) return false;
  }
  return true;
}

// 2D PSD check for a 2x2 Hessian (trace >= 0 and det >= 0)
function isPSD2x2(fxx, fxy, fyy) {
  return (fxx + fyy) >= 0 && (fxx * fyy - fxy * fxy) >= 0;
}

function isConvex2D(f, cx, cy, eps = 1e-5) {
  const fxx = (f(cx+eps,cy) - 2*f(cx,cy) + f(cx-eps,cy)) / (eps*eps);
  const fyy = (f(cx,cy+eps) - 2*f(cx,cy) + f(cx,cy-eps)) / (eps*eps);
  const fxy = (f(cx+eps,cy+eps) - f(cx+eps,cy-eps)
             - f(cx-eps,cy+eps) + f(cx-eps,cy-eps)) / (4*eps*eps);
  return isPSD2x2(fxx, fxy, fyy);
}

console.log(isConvex1D(x => x*x,  -3, 3)); // true
console.log(isConvex1D(x => -x*x, -3, 3)); // false
console.log(isConvex2D((x,y) => x*x+y*y, 0, 0)); // true

#include <stdio.h>
#include <math.h>

typedef double (*Fn1)(double);

int is_convex_1d(Fn1 f, double a, double b, int n) {
    double h = (b - a) / n;
    for (int i = 1; i < n - 1; i++) {
        double x  = a + i * h;
        double d2 = (f(x+h) - 2*f(x) + f(x-h)) / (h*h);
        if (d2 < -1e-9) return 0;
    }
    return 1;
}

double bowl(double x)    { return x*x; }
double concave(double x) { return -x*x; }

int main(void) {
    printf("bowl convex:    %d\n", is_convex_1d(bowl,    -3, 3, 200)); // 1
    printf("concave convex: %d\n", is_convex_1d(concave, -3, 3, 200)); // 0
    return 0;
}

#include <iostream>
#include <functional>
#include <cmath>

bool isConvex1D(std::function<double(double)> f, double a, double b, int n=200) {
    double h = (b - a) / n;
    for (int i = 1; i < n - 1; i++) {
        double x  = a + i * h;
        double d2 = (f(x+h) - 2*f(x) + f(x-h)) / (h*h);
        if (d2 < -1e-9) return false;
    }
    return true;
}

bool isConvex2D(std::function<double(double,double)> f, double cx, double cy, double eps=1e-5) {
    double fxx = (f(cx+eps,cy) - 2*f(cx,cy) + f(cx-eps,cy)) / (eps*eps);
    double fyy = (f(cx,cy+eps) - 2*f(cx,cy) + f(cx,cy-eps)) / (eps*eps);
    double fxy = (f(cx+eps,cy+eps) - f(cx+eps,cy-eps)
                - f(cx-eps,cy+eps) + f(cx-eps,cy-eps)) / (4*eps*eps);
    return (fxx + fyy) >= 0 && (fxx*fyy - fxy*fxy) >= 0;
}

int main() {
    std::cout << isConvex1D([](double x){ return x*x; },   -3, 3) << "\n"; // 1
    std::cout << isConvex1D([](double x){ return -x*x; },  -3, 3) << "\n"; // 0
    std::cout << isConvex2D([](double x, double y){ return x*x+y*y; }, 0, 0) << "\n"; // 1
}

import java.util.function.DoubleUnaryOperator;

public class ConvexityCheck {
    static boolean isConvex1D(DoubleUnaryOperator f, double a, double b, int n) {
        double h = (b - a) / n;
        for (int i = 1; i < n - 1; i++) {
            double x  = a + i * h;
            double d2 = (f.applyAsDouble(x+h) - 2*f.applyAsDouble(x)
                        + f.applyAsDouble(x-h)) / (h*h);
            if (d2 < -1e-9) return false;
        }
        return true;
    }

    public static void main(String[] args) {
        System.out.println(isConvex1D(x -> x*x,  -3, 3, 200)); // true
        System.out.println(isConvex1D(x -> -x*x, -3, 3, 200)); // false
    }
}

package main

import "fmt"

func isConvex1D(f func(float64) float64, a, b float64, n int) bool {
	h := (b - a) / float64(n)
	for i := 1; i < n-1; i++ {
		x  := a + float64(i)*h
		d2 := (f(x+h) - 2*f(x) + f(x-h)) / (h * h)
		if d2 < -1e-9 {
			return false
		}
	}
	return true
}

func main() {
	bowl    := func(x float64) float64 { return x * x }
	concave := func(x float64) float64 { return -x * x }
	fmt.Println(isConvex1D(bowl,    -3, 3, 200)) // true
	fmt.Println(isConvex1D(concave, -3, 3, 200)) // false
}

5. Interactive — gradient descent tracer

Click anywhere on the surface below. A ball drops at that point and starts rolling downhill, following the negative gradient. The contours show a 2D loss function that's a mix of two Gaussian wells (one shallow, one deep) — non-convex enough to be interesting. Slide the learning rate to see GD overshoot, oscillate, or diverge. Add momentum to roll past small bumps. Try starting from different points to see if you always end up at the same place.

Learning rate η: 0.08

Momentum β: 0.00

Steps: 60

Click on the surface to drop a ball. Lower contours are darker; the two basins are not equally deep.

Things to try:

Crank η to 0.4 with momentum 0. The ball leaps across valleys and may bounce out. This is what too-large a learning rate looks like.
Set η to 0.02, momentum to 0.9. The ball glides smoothly to the deepest minimum, even if it starts in the shallow basin. Momentum helps you escape small wells.
Click in different starting points. Sometimes you converge to the deep minimum, sometimes the shallow one. That's the curse of non-convex problems — initialization matters.

6. Gradient descent — the world's most-used algorithm

Gradient descent is what the ball is doing in the demo above. The idea is one sentence: at each step, move opposite the gradient. The gradient points uphill (steepest ascent), so the negative gradient points downhill (steepest descent). The update rule is:

\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \, \nabla f(\mathbf{x}_t)

Gradient descent update

$\mathbf{x}_t$: The current iterate — your guess at step $t$.
$\mathbf{x}_{t+1}$: The new guess after one step.
$\nabla f(\mathbf{x}_t)$: The gradient of the objective at the current point. Points uphill.
$\eta$: The learning rate or step size. A small positive scalar that controls how far you move per step. In ML literature it's $\eta$ or $\alpha$ or $\text{lr}$; same thing.
$-$: The minus sign turns "uphill" into "downhill." That's the entire algorithm.

Analogy. You're standing in fog on a hillside and you want to get to the bottom. You can feel the slope under your feet. Take a small step in the steepest downhill direction. Repeat. If the step is too small you'll be there forever; too large and you'll trip and fall up the next hill. The right step size is the entire art.

Convergence

For a convex, $L$-smooth function (meaning the gradient doesn't change too fast — formally, $\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L \|\mathbf{x} - \mathbf{y}\|$), gradient descent with constant step $\eta = 1/L$ satisfies:

f(\mathbf{x}_t) - f(\mathbf{x}^\star) \leq \frac{L \, \|\mathbf{x}_0 - \mathbf{x}^\star\|^2}{2t}

GD convergence rate (convex, smooth)

$f(\mathbf{x}_t) - f(\mathbf{x}^\star)$: The "suboptimality gap" — how far off you are from the true minimum value at iteration $t$.
$L$: The Lipschitz constant of the gradient — a measure of how curved the function can be. Larger $L$ means more curvature, which means smaller safe steps.
$\|\mathbf{x}_0 - \mathbf{x}^\star\|^2$: Squared distance from your starting point to the optimum — the $\|\cdot\|$ is the standard Euclidean norm.
$1/t$: The crucial part — error shrinks like $1/t$. To halve the error, you need twice as many steps. This is "sublinear" convergence — slow but steady.

Translation. GD is slow but it works. For strongly convex functions (curvature bounded below as well as above), the rate improves to $(1 - \mu/L)^t$, which is exponentially fast. The ratio $L/\mu$ is called the condition number and it's the single number that says how hard your problem is.

Learning rate sensitivity

Too small: GD crawls. Too large: GD overshoots and may diverge. The exact safe range depends on the curvature of $f$. For a quadratic $f(\mathbf{x}) = \tfrac{1}{2}\mathbf{x}^\top H \mathbf{x}$ with $H$ symmetric positive-definite, the largest stable step is $\eta < 2/\lambda_{\max}(H)$ where $\lambda_{\max}$ is the largest eigenvalue. In practice you try a few values logarithmically and pick the largest one that doesn't blow up. Most of the variants in the next section exist to make this less painful.

Source — Gradient Descent

// Gradient descent with convergence check
// grad_fn: computes gradient at x
// x0:  starting point (vector)
// lr:  learning rate
// tol: stop when ||grad|| < tol

function gradient_descent(grad_fn, x0, lr, tol, max_iter):
    x = copy(x0)
    history = [x]
    for t = 1 to max_iter:
        g = grad_fn(x)
        if norm(g) < tol:
            break
        x = x - lr * g
        history.append(x)
    return x, history

import numpy as np

def gradient_descent(grad_fn, x0, lr=0.01, tol=1e-6, max_iter=10000):
    """
    Gradient descent with convergence check.

    Parameters
    ----------
    grad_fn  : callable, returns gradient vector at x
    x0       : ndarray, starting point
    lr       : float, learning rate (step size)
    tol      : float, stop when ||grad|| < tol
    max_iter : int,   maximum iterations

    Returns
    -------
    x        : final iterate
    history  : list of iterates (one per step)
    """
    x = np.array(x0, dtype=float)
    history = [x.copy()]
    for _ in range(max_iter):
        g = grad_fn(x)
        if np.linalg.norm(g) < tol:
            break
        x = x - lr * g
        history.append(x.copy())
    return x, history

# Example: minimize f(x,y) = x^2 + 4y^2  (bowl with different curvatures)
def grad_fn(x):
    return np.array([2*x[0], 8*x[1]])

x_star, hist = gradient_descent(grad_fn, x0=[3.0, 2.0], lr=0.1)
print(f"Converged to {x_star} in {len(hist)} steps")
# => Converged to [~0, ~0]

function gradientDescent(gradFn, x0, lr = 0.01, tol = 1e-6, maxIter = 10000) {
  let x = [...x0];
  const history = [[...x]];

  for (let t = 0; t < maxIter; t++) {
    const g    = gradFn(x);
    const norm = Math.sqrt(g.reduce((s, gi) => s + gi*gi, 0));
    if (norm < tol) break;
    x = x.map((xi, i) => xi - lr * g[i]);
    history.push([...x]);
  }
  return { x, history };
}

// f(x,y) = x^2 + 4y^2
const gradFn = ([x, y]) => [2*x, 8*y];
const { x, history } = gradientDescent(gradFn, [3, 2], 0.1);
console.log("Steps:", history.length, "Final:", x.map(v => v.toFixed(6)));

#include <stdio.h>
#include <math.h>
#include <string.h>

#define N 2

void grad_fn(const double *x, double *g) {
    g[0] = 2.0 * x[0];
    g[1] = 8.0 * x[1];
}

int gradient_descent(double *x, double lr, double tol, int max_iter) {
    double g[N];
    for (int t = 0; t < max_iter; t++) {
        grad_fn(x, g);
        double norm2 = 0;
        for (int i = 0; i < N; i++) norm2 += g[i]*g[i];
        if (sqrt(norm2) < tol) return t;
        for (int i = 0; i < N; i++) x[i] -= lr * g[i];
    }
    return max_iter;
}

int main(void) {
    double x[N] = {3.0, 2.0};
    int steps = gradient_descent(x, 0.1, 1e-6, 10000);
    printf("Steps: %d, x = [%.6f, %.6f]\n", steps, x[0], x[1]);
    return 0;
}

#include <iostream>
#include <vector>
#include <functional>
#include <cmath>
#include <numeric>

using Vec = std::vector<double>;

Vec gradFn(const Vec& x) { return {2*x[0], 8*x[1]}; }

std::pair<Vec, std::vector<Vec>>
gradientDescent(std::function<Vec(const Vec&)> gradFn,
                Vec x, double lr, double tol, int maxIter) {
    std::vector<Vec> history{x};
    for (int t = 0; t < maxIter; t++) {
        Vec g = gradFn(x);
        double norm = std::sqrt(std::inner_product(g.begin(), g.end(), g.begin(), 0.0));
        if (norm < tol) break;
        for (size_t i = 0; i < x.size(); i++) x[i] -= lr * g[i];
        history.push_back(x);
    }
    return {x, history};
}

int main() {
    auto [xStar, hist] = gradientDescent(gradFn, {3.0, 2.0}, 0.1, 1e-6, 10000);
    std::cout << "Steps: " << hist.size()
              << "  x = [" << xStar[0] << ", " << xStar[1] << "]\n";
}

import java.util.*;

public class GradientDescent {
    static double[] gradFn(double[] x) {
        return new double[]{ 2*x[0], 8*x[1] };
    }

    static double norm(double[] g) {
        double s = 0;
        for (double v : g) s += v*v;
        return Math.sqrt(s);
    }

    static double[] gradientDescent(double[] x0, double lr, double tol, int maxIter) {
        double[] x = x0.clone();
        for (int t = 0; t < maxIter; t++) {
            double[] g = gradFn(x);
            if (norm(g) < tol) { System.out.println("Converged at step " + t); break; }
            for (int i = 0; i < x.length; i++) x[i] -= lr * g[i];
        }
        return x;
    }

    public static void main(String[] args) {
        double[] result = gradientDescent(new double[]{3.0, 2.0}, 0.1, 1e-6, 10000);
        System.out.printf("x = [%.6f, %.6f]%n", result[0], result[1]);
    }
}

package main

import (
	"fmt"
	"math"
)

func gradFn(x []float64) []float64 {
	return []float64{2 * x[0], 8 * x[1]}
}

func norm(g []float64) float64 {
	s := 0.0
	for _, v := range g { s += v * v }
	return math.Sqrt(s)
}

func gradientDescent(x []float64, lr, tol float64, maxIter int) ([]float64, int) {
	for t := 0; t < maxIter; t++ {
		g := gradFn(x)
		if norm(g) < tol { return x, t }
		for i := range x { x[i] -= lr * g[i] }
	}
	return x, maxIter
}

func main() {
	x, steps := gradientDescent([]float64{3.0, 2.0}, 0.1, 1e-6, 10000)
	fmt.Printf("Steps: %d, x = [%.6f, %.6f]\n", steps, x[0], x[1])
}

7. Variants — momentum, Adagrad, RMSProp, Adam

Vanilla GD has been augmented dozens of ways since the 1960s. The four that matter most for ML today:

Momentum (Polyak, 1964)

Add a velocity vector that accumulates past gradients. Instead of moving in the direction of the current gradient alone, move in a smoothed direction.

\mathbf{v}_{t+1} = \beta \, \mathbf{v}_t + \nabla f(\mathbf{x}_t), \qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \, \mathbf{v}_{t+1}

Heavy-ball momentum

$\mathbf{v}_t$: The velocity — a moving average of past gradients. Initialized to zero.
$\beta$: Momentum coefficient, typically $0.9$. Larger $\beta$ means a longer memory of the past.
$\nabla f(\mathbf{x}_t)$: The current gradient.
$\eta$: Learning rate, same as before.

Picture. A heavy ball rolling in a long narrow valley. Vanilla GD zigzags across the walls. With momentum the ball builds up speed along the valley floor and barrels straight to the bottom. Helps with ill-conditioned problems and small spurious bumps.

Nesterov accelerated gradient (Nesterov, 1983)

Look ahead before you accumulate. Evaluate the gradient at the point where momentum would carry you, not where you are now.

\mathbf{v}_{t+1} = \beta \, \mathbf{v}_t + \nabla f(\mathbf{x}_t - \eta \beta \mathbf{v}_t), \qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \, \mathbf{v}_{t+1}

Nesterov accelerated gradient (NAG)

$\mathbf{x}_t - \eta \beta \mathbf{v}_t$: A "lookahead" point: take a momentum step, then evaluate the gradient there. The gradient at the future point is more informative than at the current point.
$\beta, \eta$: Same momentum coefficient and learning rate as plain momentum.

Why bother. For convex smooth problems Nesterov's method achieves $O(1/t^2)$ convergence — the optimal rate for first-order methods, proven impossible to beat. It's not always faster in practice for non-convex ML, but the theory is gorgeous.

Adagrad (Duchi, Hazan, Singer, 2011)

Different parameters need different learning rates. Adagrad gives each parameter its own step size, scaled inversely to the square root of accumulated past squared gradients. Parameters that have already moved a lot get smaller steps.

G_t = G_{t-1} + \nabla f(\mathbf{x}_t) \odot \nabla f(\mathbf{x}_t), \qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\eta}{\sqrt{G_t + \varepsilon}} \odot \nabla f(\mathbf{x}_t)

Adagrad

$G_t$: An elementwise running sum of squared gradients. Accumulates forever.
$\odot$: Elementwise (Hadamard) product. Acts component by component, not as a matrix product.
$\sqrt{\,\cdot\,}$: Elementwise square root.
$\varepsilon$: A tiny constant like $10^{-8}$, just to prevent division by zero.

Strength. Great for sparse problems (NLP with bag-of-words) where some features fire rarely. Weakness. $G_t$ grows forever, so the effective learning rate eventually goes to zero. RMSProp and Adam fix that.

RMSProp (Hinton, 2012)

Replace Adagrad's running sum with an exponential moving average. The effective rate stops dying.

G_t = \rho \, G_{t-1} + (1-\rho) \, \nabla f(\mathbf{x}_t) \odot \nabla f(\mathbf{x}_t), \qquad \mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\eta}{\sqrt{G_t + \varepsilon}} \odot \nabla f(\mathbf{x}_t)

RMSProp

$\rho$: Decay rate, typically $0.9$ or $0.99$. Controls how quickly the moving average forgets old gradients.
$G_t$: Now an EMA of squared gradients instead of an unbounded sum.
others: Same as Adagrad.

Story. Hinton introduced RMSProp in a Coursera lecture, never wrote a paper, and it ate the world anyway. Sometimes that's how research goes.

Adam (Kingma & Ba, 2014)

Combine momentum (first-moment EMA) with RMSProp (second-moment EMA). Add bias correction so the early steps aren't biased toward zero. The result is the default optimizer for almost every neural network you've used.

\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1) \, \nabla f(\mathbf{x}_t)

Adam — first moment

$\mathbf{m}_t$: EMA of past gradients — like momentum, but using a weighted average instead of a sum.
$\beta_1$: Decay rate for first moment, default $0.9$.

\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2) \, \nabla f(\mathbf{x}_t) \odot \nabla f(\mathbf{x}_t)

Adam — second moment

$\mathbf{v}_t$: EMA of squared past gradients — like RMSProp.
$\beta_2$: Decay rate for second moment, default $0.999$.

\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1 - \beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1 - \beta_2^t}, \quad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \, \frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \varepsilon}

Adam — bias-corrected update

$\hat{\mathbf{m}}_t, \hat{\mathbf{v}}_t$: Bias-corrected moments. The $1 - \beta^t$ factor compensates for the fact that EMAs initialized at zero are biased toward zero in early steps.
$\beta_1^t$: $\beta_1$ raised to the $t$-th power. Tends to zero as $t$ grows, so the correction vanishes.
$\eta$: Learning rate, default $10^{-3}$ for most uses.
$\varepsilon$: $10^{-8}$ for numerical safety.

Why Adam took over. It works well out of the box on a huge range of problems with little tuning. The default $(\eta, \beta_1, \beta_2) = (10^{-3}, 0.9, 0.999)$ is the most reused hyperparameter triple in ML.

You'll meet Adam again in backpropagation and foundation models, where it (or its weight-decay variant AdamW) is the workhorse. There are dozens of newer competitors — Lion, Sophia, Shampoo, Muon — but Adam remains the baseline you should beat.

Source — SGD Variants (Momentum, AdaGrad, Adam)

// MOMENTUM
v = 0
for t = 1 to T:
    g = grad(x)
    v = beta * v + g           // accumulate velocity
    x = x - lr * v

// ADAGRAD
G = 0
for t = 1 to T:
    g = grad(x)
    G = G + g*g                // accumulate squared gradients
    x = x - lr * g / sqrt(G + eps)

// ADAM (Kingma & Ba, 2014)
m = 0;  v = 0
for t = 1 to T:
    g      = grad(x)
    m      = b1*m + (1-b1)*g          // first moment EMA
    v      = b2*v + (1-b2)*(g*g)      // second moment EMA
    m_hat  = m / (1 - b1^t)           // bias correction
    v_hat  = v / (1 - b2^t)
    x      = x - lr * m_hat / (sqrt(v_hat) + eps)

import numpy as np

def momentum(grad_fn, x0, lr=0.01, beta=0.9, n=500):
    x, v = np.array(x0, float), np.zeros_like(x0, float)
    for _ in range(n):
        g = grad_fn(x)
        v = beta * v + g
        x = x - lr * v
    return x

def adagrad(grad_fn, x0, lr=0.1, eps=1e-8, n=500):
    x, G = np.array(x0, float), np.zeros_like(x0, float)
    for _ in range(n):
        g = grad_fn(x)
        G = G + g * g
        x = x - lr * g / (np.sqrt(G) + eps)
    return x

def adam(grad_fn, x0, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8, n=500):
    x = np.array(x0, float)
    m, v = np.zeros_like(x), np.zeros_like(x)
    for t in range(1, n + 1):
        g     = grad_fn(x)
        m     = b1 * m + (1 - b1) * g
        v     = b2 * v + (1 - b2) * (g * g)
        m_hat = m / (1 - b1**t)
        v_hat = v / (1 - b2**t)
        x     = x - lr * m_hat / (np.sqrt(v_hat) + eps)
    return x

# Test on f(x,y) = x^2 + 10y^2  (ill-conditioned)
grad_fn = lambda x: np.array([2*x[0], 20*x[1]])
x0 = [3.0, 2.0]
print("Momentum:", momentum(grad_fn, x0))
print("Adagrad: ", adagrad(grad_fn, x0))
print("Adam:    ", adam(grad_fn, x0))

function momentum(gradFn, x0, lr=0.01, beta=0.9, n=500) {
  let x = [...x0], v = x0.map(() => 0);
  for (let t = 0; t < n; t++) {
    const g = gradFn(x);
    v = v.map((vi, i) => beta * vi + g[i]);
    x = x.map((xi, i) => xi - lr * v[i]);
  }
  return x;
}

function adagrad(gradFn, x0, lr=0.1, eps=1e-8, n=500) {
  let x = [...x0], G = x0.map(() => 0);
  for (let t = 0; t < n; t++) {
    const g = gradFn(x);
    G = G.map((Gi, i) => Gi + g[i]*g[i]);
    x = x.map((xi, i) => xi - lr * g[i] / (Math.sqrt(G[i]) + eps));
  }
  return x;
}

function adam(gradFn, x0, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8, n=500) {
  let x = [...x0];
  let m = x0.map(() => 0), v = x0.map(() => 0);
  for (let t = 1; t <= n; t++) {
    const g    = gradFn(x);
    m = m.map((mi, i) => b1*mi + (1-b1)*g[i]);
    v = v.map((vi, i) => b2*vi + (1-b2)*g[i]*g[i]);
    const mh = m.map(mi => mi / (1 - Math.pow(b1, t)));
    const vh = v.map(vi => vi / (1 - Math.pow(b2, t)));
    x = x.map((xi, i) => xi - lr * mh[i] / (Math.sqrt(vh[i]) + eps));
  }
  return x;
}

const gradFn = ([x, y]) => [2*x, 20*y];
const x0 = [3, 2];
console.log("Momentum:", momentum(gradFn, x0));
console.log("Adam:    ", adam(gradFn, x0));

#include <stdio.h>
#include <math.h>
#include <string.h>

#define N 2

void grad_fn(const double *x, double *g) { g[0]=2*x[0]; g[1]=20*x[1]; }

void run_adam(double *x, double lr, double b1, double b2, double eps, int steps) {
    double m[N]={0}, v[N]={0}, g[N];
    for (int t=1; t<=steps; t++) {
        grad_fn(x, g);
        for (int i=0; i<N; i++) {
            m[i] = b1*m[i] + (1-b1)*g[i];
            v[i] = b2*v[i] + (1-b2)*g[i]*g[i];
            double mh = m[i] / (1 - pow(b1,t));
            double vh = v[i] / (1 - pow(b2,t));
            x[i] -= lr * mh / (sqrt(vh) + eps);
        }
    }
}

void run_momentum(double *x, double lr, double beta, int steps) {
    double v[N]={0}, g[N];
    for (int t=0; t<steps; t++) {
        grad_fn(x, g);
        for (int i=0; i<N; i++) { v[i]=beta*v[i]+g[i]; x[i]-=lr*v[i]; }
    }
}

int main(void) {
    double xa[N]={3,2}, xb[N]={3,2};
    run_momentum(xa, 0.01, 0.9, 500);
    run_adam(xb, 1e-3, 0.9, 0.999, 1e-8, 500);
    printf("Momentum: [%.4f, %.4f]\n", xa[0], xa[1]);
    printf("Adam:     [%.4f, %.4f]\n", xb[0], xb[1]);
    return 0;
}

#include <iostream>
#include <vector>
#include <functional>
#include <cmath>

using Vec = std::vector<double>;
using GradFn = std::function<Vec(const Vec&)>;

Vec adam(GradFn gFn, Vec x, double lr=1e-3, double b1=0.9, double b2=0.999,
         double eps=1e-8, int n=500) {
    Vec m(x.size(), 0), v(x.size(), 0);
    for (int t = 1; t <= n; t++) {
        Vec g = gFn(x);
        for (size_t i = 0; i < x.size(); i++) {
            m[i] = b1*m[i] + (1-b1)*g[i];
            v[i] = b2*v[i] + (1-b2)*g[i]*g[i];
            double mh = m[i]/(1-std::pow(b1,t));
            double vh = v[i]/(1-std::pow(b2,t));
            x[i] -= lr * mh / (std::sqrt(vh) + eps);
        }
    }
    return x;
}

int main() {
    auto gFn = [](const Vec& x) -> Vec { return {2*x[0], 20*x[1]}; };
    Vec res = adam(gFn, {3.0, 2.0});
    std::cout << "Adam: [" << res[0] << ", " << res[1] << "]\n";
}

public class SGDVariants {
    static double[] gradFn(double[] x) { return new double[]{2*x[0], 20*x[1]}; }

    static double[] adam(double[] x0, double lr, double b1, double b2, double eps, int n) {
        double[] x = x0.clone(), m = new double[x.length], v = new double[x.length];
        for (int t = 1; t <= n; t++) {
            double[] g = gradFn(x);
            for (int i = 0; i < x.length; i++) {
                m[i] = b1*m[i] + (1-b1)*g[i];
                v[i] = b2*v[i] + (1-b2)*g[i]*g[i];
                double mh = m[i] / (1 - Math.pow(b1, t));
                double vh = v[i] / (1 - Math.pow(b2, t));
                x[i] -= lr * mh / (Math.sqrt(vh) + eps);
            }
        }
        return x;
    }

    static double[] momentum(double[] x0, double lr, double beta, int n) {
        double[] x = x0.clone(), vel = new double[x.length];
        for (int t = 0; t < n; t++) {
            double[] g = gradFn(x);
            for (int i = 0; i < x.length; i++) { vel[i]=beta*vel[i]+g[i]; x[i]-=lr*vel[i]; }
        }
        return x;
    }

    public static void main(String[] args) {
        double[] a = adam(new double[]{3,2}, 1e-3, 0.9, 0.999, 1e-8, 500);
        double[] b = momentum(new double[]{3,2}, 0.01, 0.9, 500);
        System.out.printf("Adam:     [%.4f, %.4f]%n", a[0], a[1]);
        System.out.printf("Momentum: [%.4f, %.4f]%n", b[0], b[1]);
    }
}

package main

import (
	"fmt"
	"math"
)

func gradFn(x []float64) []float64 { return []float64{2 * x[0], 20 * x[1]} }

func adamOpt(x []float64, lr, b1, b2, eps float64, n int) []float64 {
	m := make([]float64, len(x))
	v := make([]float64, len(x))
	for t := 1; t <= n; t++ {
		g := gradFn(x)
		for i := range x {
			m[i] = b1*m[i] + (1-b1)*g[i]
			v[i] = b2*v[i] + (1-b2)*g[i]*g[i]
			mh := m[i] / (1 - math.Pow(b1, float64(t)))
			vh := v[i] / (1 - math.Pow(b2, float64(t)))
			x[i] -= lr * mh / (math.Sqrt(vh) + eps)
		}
	}
	return x
}

func momentumOpt(x []float64, lr, beta float64, n int) []float64 {
	vel := make([]float64, len(x))
	for t := 0; t < n; t++ {
		g := gradFn(x)
		for i := range x { vel[i] = beta*vel[i] + g[i]; x[i] -= lr * vel[i] }
	}
	return x
}

func main() {
	fmt.Println("Adam:    ", adamOpt([]float64{3, 2}, 1e-3, 0.9, 0.999, 1e-8, 500))
	fmt.Println("Momentum:", momentumOpt([]float64{3, 2}, 0.01, 0.9, 500))
}

8. Stochastic gradient descent — when computing the gradient is too expensive

So far we've assumed you can compute the full gradient $\nabla f(\mathbf{x})$ at every step. In ML, $f$ is usually a sum over millions of training examples:

f(\mathbf{x}) = \frac{1}{N} \sum_{i=1}^{N} f_i(\mathbf{x})

Empirical risk

$f$: The training loss, averaged over the dataset.
$N$: The number of training examples — could be $10^9$.
$f_i$: The loss on the $i$-th individual example. For classification this might be a cross-entropy term, for regression a squared error.

Computing $\nabla f$ requires going through all $N$ examples — too expensive for modern datasets. Instead, you sample a small minibatch $B \subset \{1, \dots, N\}$ at each step and use:

\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \, \frac{1}{|B|} \sum_{i \in B} \nabla f_i(\mathbf{x}_t)

Stochastic gradient descent (SGD)

$B$: A minibatch — a small random subset of training indices, usually 32 to 4096 in size.
$|B|$: The number of examples in the minibatch.
$\nabla f_i$: The gradient contribution from a single example. Cheap to compute.
$\frac{1}{|B|} \sum$: The average gradient on the minibatch — an unbiased noisy estimate of the true full-batch gradient.

Analogy. Instead of polling every voter to find the average opinion (full-batch GD), poll a random sample of 100 (SGD). Your estimate is noisy but you get a useful direction in $1/N$ the time. Across many steps the noise averages out.

The shocking empirical fact: SGD doesn't just work, it often works better than full-batch GD on neural networks. Two reasons:

Wall-clock speedup. A single SGD step is hundreds of times cheaper than a full-batch step, and the estimated gradient is good enough.
Implicit regularization. The noise in SGD helps escape sharp minima and saddle points. Recent theory (Keskar 2017, Smith et al. 2020) shows SGD has a bias toward "flat" minima, which generalize better than sharp ones.

Convergence-wise, SGD converges sublinearly — about $O(1/\sqrt{t})$ for convex problems with constant step size — slower in iterations than full-batch GD, but each iteration costs much less. With a decreasing step size you can get $O(1/t)$. In ML you usually mix SGD with momentum and an EMA, which is essentially Adam.

Source — Stochastic Gradient Descent

// Mini-batch SGD on squared loss: f(w) = (1/N) * sum_i (w*x_i - y_i)^2
// grad of f_i w.r.t. w = 2 * (w*x_i - y_i) * x_i

function mini_batch_sgd(X, y, w0, lr, batch_size, epochs):
    w = w0
    N = len(X)
    for epoch = 1 to epochs:
        shuffle(X, y)
        for b = 0 to N step batch_size:
            X_b = X[b : b+batch_size]
            y_b = y[b : b+batch_size]
            grad = (2/batch_size) * X_b^T * (X_b * w - y_b)
            w = w - lr * grad
    return w

import numpy as np

def mini_batch_sgd(X, y, lr=0.01, batch_size=32, epochs=50):
    """
    Mini-batch SGD on squared loss  f(w) = ||Xw - y||^2 / N.
    X : (N, d) feature matrix
    y : (N,)   targets
    Returns weight vector w of shape (d,).
    """
    N, d = X.shape
    w = np.zeros(d)
    for epoch in range(epochs):
        perm = np.random.permutation(N)
        X_s, y_s = X[perm], y[perm]
        for b in range(0, N, batch_size):
            Xb = X_s[b:b + batch_size]
            yb = y_s[b:b + batch_size]
            grad = (2 / len(yb)) * Xb.T @ (Xb @ w - yb)
            w -= lr * grad
    return w

# Synthetic data: y = 3*x1 - 2*x2 + noise
rng = np.random.default_rng(42)
N, d = 500, 2
X = rng.standard_normal((N, d))
w_true = np.array([3.0, -2.0])
y = X @ w_true + 0.1 * rng.standard_normal(N)

w_hat = mini_batch_sgd(X, y, lr=0.05, batch_size=32, epochs=100)
print("True w:", w_true)
print("Est  w:", w_hat.round(4))
# => close to [3.0, -2.0]

function miniBatchSGD(X, y, lr=0.01, batchSize=32, epochs=50) {
  const N = X.length, d = X[0].length;
  let w = new Array(d).fill(0);

  for (let ep = 0; ep < epochs; ep++) {
    // shuffle indices
    const perm = Array.from({length: N}, (_, i) => i)
      .sort(() => Math.random() - 0.5);

    for (let b = 0; b < N; b += batchSize) {
      const idx = perm.slice(b, b + batchSize);
      const bs  = idx.length;

      // grad = (2/bs) * X_b^T * (X_b @ w - y_b)
      const grad = new Array(d).fill(0);
      for (const i of idx) {
        let pred = 0;
        for (let j = 0; j < d; j++) pred += X[i][j] * w[j];
        const err = pred - y[i];
        for (let j = 0; j < d; j++) grad[j] += (2 / bs) * X[i][j] * err;
      }
      for (let j = 0; j < d; j++) w[j] -= lr * grad[j];
    }
  }
  return w;
}

// Quick test
const X = [[1,2],[3,4],[5,6],[7,8]];
const y = [5, 11, 17, 23]; // w = [1, 2] exactly
const w = miniBatchSGD(X, y, 0.01, 2, 200);
console.log("w:", w.map(v => v.toFixed(3)));

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

#define N_SAMPLES 100
#define N_FEATURES 2

// Mini-batch SGD: squared loss, batch_size=1 (stochastic)
void sgd(double X[][N_FEATURES], double *y, double *w,
         double lr, int epochs) {
    double grad[N_FEATURES];
    for (int ep = 0; ep < epochs; ep++) {
        for (int i = 0; i < N_SAMPLES; i++) {
            double pred = 0;
            for (int j = 0; j < N_FEATURES; j++) pred += X[i][j] * w[j];
            double err = pred - y[i];
            for (int j = 0; j < N_FEATURES; j++)
                w[j] -= lr * 2 * X[i][j] * err;
        }
    }
}

int main(void) {
    double X[N_SAMPLES][N_FEATURES], y[N_SAMPLES];
    double w_true[N_FEATURES] = {3.0, -2.0};
    srand(42);
    for (int i = 0; i < N_SAMPLES; i++) {
        X[i][0] = ((double)rand()/RAND_MAX) * 4 - 2;
        X[i][1] = ((double)rand()/RAND_MAX) * 4 - 2;
        y[i] = w_true[0]*X[i][0] + w_true[1]*X[i][1];
    }
    double w[N_FEATURES] = {0, 0};
    sgd(X, y, w, 0.01, 50);
    printf("w = [%.4f, %.4f]\n", w[0], w[1]);
    return 0;
}

#include <iostream>
#include <vector>
#include <random>
#include <numeric>

using Vec = std::vector<double>;

Vec sgd(const std::vector<Vec>& X, const Vec& y,
        double lr=0.01, int batchSize=32, int epochs=50) {
    int N = X.size(), d = X[0].size();
    Vec w(d, 0.0);
    std::mt19937 rng(42);

    for (int ep = 0; ep < epochs; ep++) {
        std::vector<int> idx(N); std::iota(idx.begin(), idx.end(), 0);
        std::shuffle(idx.begin(), idx.end(), rng);

        for (int b = 0; b < N; b += batchSize) {
            int bs = std::min(batchSize, N - b);
            Vec grad(d, 0.0);
            for (int k = b; k < b + bs; k++) {
                int i  = idx[k];
                double pred = std::inner_product(X[i].begin(), X[i].end(), w.begin(), 0.0);
                double err  = pred - y[i];
                for (int j = 0; j < d; j++) grad[j] += (2.0/bs) * X[i][j] * err;
            }
            for (int j = 0; j < d; j++) w[j] -= lr * grad[j];
        }
    }
    return w;
}

int main() {
    // Tiny dataset: y = 1*x + 2
    std::vector<Vec> X = {{1},{2},{3},{4},{5}};
    Vec y = {3, 4, 5, 6, 7};
    Vec w = sgd(X, y, 0.01, 2, 200);
    std::cout << "w = " << w[0] << "\n"; // ~1.0 (intercept absorbed)
}

import java.util.*;

public class MiniBatchSGD {
    static double dot(double[] a, double[] b) {
        double s = 0; for (int i = 0; i < a.length; i++) s += a[i]*b[i]; return s;
    }

    static double[] sgd(double[][] X, double[] y, double lr, int batchSize, int epochs) {
        int N = X.length, d = X[0].length;
        double[] w = new double[d];
        Random rng = new Random(42);
        Integer[] idx = new Integer[N];
        for (int i = 0; i < N; i++) idx[i] = i;

        for (int ep = 0; ep < epochs; ep++) {
            Collections.shuffle(Arrays.asList(idx), rng);
            for (int b = 0; b < N; b += batchSize) {
                int bs = Math.min(batchSize, N - b);
                double[] grad = new double[d];
                for (int k = b; k < b + bs; k++) {
                    int i = idx[k];
                    double err = dot(X[i], w) - y[i];
                    for (int j = 0; j < d; j++) grad[j] += (2.0/bs) * X[i][j] * err;
                }
                for (int j = 0; j < d; j++) w[j] -= lr * grad[j];
            }
        }
        return w;
    }

    public static void main(String[] args) {
        double[][] X = {{1,1},{2,1},{3,1},{4,1},{5,1}};
        double[] y   = {3, 5, 7, 9, 11}; // w = [2, 1]
        double[] w   = sgd(X, y, 0.01, 2, 300);
        System.out.printf("w = [%.4f, %.4f]%n", w[0], w[1]);
    }
}

package main

import (
	"fmt"
	"math/rand"
)

func sgd(X [][]float64, y []float64, lr float64, batchSize, epochs int) []float64 {
	N, d := len(X), len(X[0])
	w := make([]float64, d)
	rng := rand.New(rand.NewSource(42))

	for ep := 0; ep < epochs; ep++ {
		perm := rng.Perm(N)
		for b := 0; b < N; b += batchSize {
			end := b + batchSize
			if end > N { end = N }
			bs := float64(end - b)
			grad := make([]float64, d)
			for _, i := range perm[b:end] {
				pred := 0.0
				for j := 0; j < d; j++ { pred += X[i][j] * w[j] }
				err := pred - y[i]
				for j := 0; j < d; j++ { grad[j] += (2 / bs) * X[i][j] * err }
			}
			for j := 0; j < d; j++ { w[j] -= lr * grad[j] }
		}
	}
	return w
}

func main() {
	X := [][]float64{{1, 1}, {2, 1}, {3, 1}, {4, 1}, {5, 1}}
	y := []float64{3, 5, 7, 9, 11}
	w := sgd(X, y, 0.01, 2, 300)
	fmt.Printf("w = [%.4f, %.4f]\n", w[0], w[1])
}

9. Newton's method and quasi-Newton

Gradient descent uses only first-order information. If you also have the Hessian, you can do much better. Newton's method takes a Newton step:

\mathbf{x}_{t+1} = \mathbf{x}_t - \big[\nabla^2 f(\mathbf{x}_t)\big]^{-1} \, \nabla f(\mathbf{x}_t)

Newton's method

$\nabla^2 f(\mathbf{x}_t)$: The Hessian at the current iterate — an $n \times n$ symmetric matrix of second derivatives.
$[\,\cdot\,]^{-1}$: The inverse of that matrix. In practice you don't form the inverse — you solve the linear system $H \mathbf{p} = -\nabla f$ for the search direction $\mathbf{p}$.
$\nabla f(\mathbf{x}_t)$: The gradient at the current iterate.

Picture. GD uses the local slope to pick a direction. Newton uses the local curvature to also pick the right step length — pretending $f$ is a perfect quadratic and jumping straight to the minimum of that quadratic. If the function really is quadratic, you arrive in one step.

Convergence is dazzling: locally quadratic — the number of correct digits doubles every iteration. Globally (with safeguards like a line search), Newton beats first-order methods by orders of magnitude on smooth problems. The catch is twofold:

Forming and inverting the Hessian costs $O(n^3)$. For $n = 10^9$ ML weights, this is infeasible by a factor of $10^{18}$.
The Hessian might not be PD. At a saddle point, Newton steps can move you uphill. Damping and trust regions are needed.

Quasi-Newton — BFGS and L-BFGS

Quasi-Newton methods build a cheap approximation $B_t \approx \nabla^2 f(\mathbf{x}_t)$ from observed gradient differences. The most famous is BFGS (Broyden, Fletcher, Goldfarb, Shanno, 1970), which updates an $n \times n$ approximation each step. For large $n$, the limited-memory version L-BFGS keeps only the last $\sim 10$ gradient differences and reconstructs the Hessian action on the fly. L-BFGS is the go-to optimizer for deterministic, smooth problems with thousands to a million variables — not as good as Adam for noisy ML, but a beast for classical statistics, image registration, and physics inverse problems.

Source — Newton's Method

// 1D Newton: find zero of f'(x) (i.e., minimise f)
function newton_1d(f_grad, f_hess, x0, tol, max_iter):
    x = x0
    for i = 1 to max_iter:
        g = f_grad(x)
        h = f_hess(x)
        if |g| < tol: break
        x = x - g / h
    return x

// 2D/nD Newton: solve H * p = -grad, then x = x + p
function newton_nd(grad_fn, hess_fn, x0, tol, max_iter):
    x = x0
    for i = 1 to max_iter:
        g = grad_fn(x)
        if norm(g) < tol: break
        H = hess_fn(x)
        p = solve(H, -g)   // linear system H*p = -g
        x = x + p
    return x

import numpy as np

# ---- 1D Newton ----
def newton_1d(f_grad, f_hess, x0, tol=1e-10, max_iter=50):
    x = float(x0)
    for _ in range(max_iter):
        g = f_grad(x)
        h = f_hess(x)
        if abs(g) < tol:
            break
        x -= g / h
    return x

# Example: minimize f(x) = x^4 - 3x^3 + 2
def f_grad_1d(x): return 4*x**3 - 9*x**2
def f_hess_1d(x): return 12*x**2 - 18*x
print("1D Newton:", newton_1d(f_grad_1d, f_hess_1d, x0=4.0))
# => 2.25 (local minimum)

# ---- nD Newton ----
def newton_nd(grad_fn, hess_fn, x0, tol=1e-10, max_iter=50):
    x = np.array(x0, float)
    for _ in range(max_iter):
        g = grad_fn(x)
        if np.linalg.norm(g) < tol:
            break
        H = hess_fn(x)
        p = np.linalg.solve(H, -g)   # H * p = -g
        x = x + p
    return x

# Minimize f(x,y) = x^2 + y^2 + xy  (bowl)
def grad_2d(x): return np.array([2*x[0]+x[1], 2*x[1]+x[0]])
def hess_2d(x): return np.array([[2, 1], [1, 2]])
print("2D Newton:", newton_nd(grad_2d, hess_2d, x0=[3.0, -2.0]))
# => [0., 0.]  (exact in one step — quadratic function)

// 1D Newton
function newton1D(fGrad, fHess, x0, tol=1e-10, maxIter=50) {
  let x = x0;
  for (let i = 0; i < maxIter; i++) {
    const g = fGrad(x), h = fHess(x);
    if (Math.abs(g) < tol) break;
    x -= g / h;
  }
  return x;
}

// 2x2 linear solve: A*p = b (Cramer's rule)
function solve2x2([a,b,c,d], [e,f]) {
  const det = a*d - b*c;
  return [(e*d - b*f)/det, (a*f - e*c)/det];
}

// 2D Newton for f(x,y)=x^2+y^2+xy
function newton2D(x0, y0, tol=1e-10, maxIter=50) {
  let [x, y] = [x0, y0];
  for (let i = 0; i < maxIter; i++) {
    const g = [2*x+y, 2*y+x];
    if (Math.hypot(...g) < tol) break;
    const [px, py] = solve2x2([2,1,1,2], [-g[0], -g[1]]);
    x += px; y += py;
  }
  return [x, y];
}

console.log("1D:", newton1D(x => 4*x**3-9*x**2, x => 12*x**2-18*x, 4));
console.log("2D:", newton2D(3, -2));

#include <stdio.h>
#include <math.h>

double f_grad_1d(double x) { return 4*x*x*x - 9*x*x; }
double f_hess_1d(double x) { return 12*x*x - 18*x; }

double newton_1d(double x0, double tol, int max_iter) {
    double x = x0;
    for (int i = 0; i < max_iter; i++) {
        double g = f_grad_1d(x);
        if (fabs(g) < tol) break;
        x -= g / f_hess_1d(x);
    }
    return x;
}

// 2D Newton for f(x,y)=x^2+y^2+xy
void newton_2d(double *px, double *py, double tol, int max_iter) {
    double x = *px, y = *py;
    for (int i = 0; i < max_iter; i++) {
        double gx = 2*x+y, gy = 2*y+x;
        if (sqrt(gx*gx + gy*gy) < tol) break;
        // H = [[2,1],[1,2]], det = 3
        double det = 3.0;
        double dx = (2*(-gx) - 1*(-gy)) / det;
        double dy = (2*(-gy) - 1*(-gx)) / det;
        x += dx; y += dy;
    }
    *px = x; *py = y;
}

int main(void) {
    printf("1D Newton: %.8f\n", newton_1d(4.0, 1e-10, 50));
    double x = 3.0, y = -2.0;
    newton_2d(&x, &y, 1e-10, 50);
    printf("2D Newton: [%.8f, %.8f]\n", x, y);
    return 0;
}

#include <iostream>
#include <vector>
#include <functional>
#include <cmath>

using Vec = std::vector<double>;

double newton1D(std::function<double(double)> g,
                std::function<double(double)> h,
                double x0, double tol=1e-10, int n=50) {
    double x = x0;
    for (int i = 0; i < n; i++) {
        double gv = g(x);
        if (std::abs(gv) < tol) break;
        x -= gv / h(x);
    }
    return x;
}

// 2D Newton (hardcoded H = [[2,1],[1,2]])
std::pair<double,double> newton2D(double x0, double y0, double tol=1e-10, int n=50) {
    double x = x0, y = y0;
    for (int i = 0; i < n; i++) {
        double gx = 2*x+y, gy = 2*y+x;
        if (std::sqrt(gx*gx+gy*gy) < tol) break;
        double det = 3.0;
        x += (2*(-gx) - 1*(-gy)) / det;
        y += (2*(-gy) - 1*(-gx)) / det;
    }
    return {x, y};
}

int main() {
    double x1 = newton1D([](double x){ return 4*x*x*x-9*x*x; },
                         [](double x){ return 12*x*x-18*x; }, 4.0);
    std::cout << "1D: " << x1 << "\n";
    auto [x2, y2] = newton2D(3.0, -2.0);
    std::cout << "2D: [" << x2 << ", " << y2 << "]\n";
}

public class NewtonMethod {
    static double fGrad1D(double x) { return 4*x*x*x - 9*x*x; }
    static double fHess1D(double x) { return 12*x*x - 18*x; }

    static double newton1D(double x0, double tol, int maxIter) {
        double x = x0;
        for (int i = 0; i < maxIter; i++) {
            double g = fGrad1D(x);
            if (Math.abs(g) < tol) break;
            x -= g / fHess1D(x);
        }
        return x;
    }

    // 2D Newton for f(x,y)=x^2+y^2+xy, H=[[2,1],[1,2]]
    static double[] newton2D(double x0, double y0, double tol, int maxIter) {
        double x = x0, y = y0;
        for (int i = 0; i < maxIter; i++) {
            double gx = 2*x+y, gy = 2*y+x;
            if (Math.sqrt(gx*gx+gy*gy) < tol) break;
            double det = 3.0;
            x += (2*(-gx) - 1*(-gy)) / det;
            y += (2*(-gy) - 1*(-gx)) / det;
        }
        return new double[]{x, y};
    }

    public static void main(String[] args) {
        System.out.printf("1D Newton: %.8f%n", newton1D(4.0, 1e-10, 50));
        double[] r = newton2D(3.0, -2.0, 1e-10, 50);
        System.out.printf("2D Newton: [%.8f, %.8f]%n", r[0], r[1]);
    }
}

package main

import (
	"fmt"
	"math"
)

func newton1D(x0, tol float64, maxIter int) float64 {
	x := x0
	for i := 0; i < maxIter; i++ {
		g := 4*x*x*x - 9*x*x
		if math.Abs(g) < tol { break }
		h := 12*x*x - 18*x
		x -= g / h
	}
	return x
}

// 2D Newton for f(x,y) = x^2 + y^2 + xy
func newton2D(x0, y0, tol float64, maxIter int) (float64, float64) {
	x, y := x0, y0
	for i := 0; i < maxIter; i++ {
		gx, gy := 2*x+y, 2*y+x
		if math.Sqrt(gx*gx+gy*gy) < tol { break }
		det := 3.0
		x += (2*(-gx) - 1*(-gy)) / det
		y += (2*(-gy) - 1*(-gx)) / det
	}
	return x, y
}

func main() {
	fmt.Printf("1D Newton: %.8f\n", newton1D(4.0, 1e-10, 50))
	x, y := newton2D(3.0, -2.0, 1e-10, 50)
	fmt.Printf("2D Newton: [%.8f, %.8f]\n", x, y)
}

10. Constrained optimization — Lagrange multipliers

What if you have constraints? Then the gradient of $f$ doesn't have to be zero at the optimum — the constraint can hold you back. Imagine pushing a ball into a wall: the wall pushes back, balancing your push. The optimum is where those two forces cancel.

Lagrange multipliers turn that idea into algebra. For a problem with equality constraints only:

\min_\mathbf{x} \; f(\mathbf{x}) \quad \text{subject to} \quad g_i(\mathbf{x}) = 0, \; i = 1, \dots, m

Equality-constrained problem

$f$: Objective.
$g_i(\mathbf{x}) = 0$: The $i$-th equality constraint — forces $\mathbf{x}$ onto a surface in $\mathbb{R}^n$.
$m$: Number of equality constraints. The feasible set is the intersection of $m$ surfaces, generically a $(n-m)$-dimensional manifold.

Define the Lagrangian:

\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}) = f(\mathbf{x}) - \sum_{i=1}^m \lambda_i \, g_i(\mathbf{x})

The Lagrangian

$\mathcal{L}$: The Lagrangian — a single function of both the original variables $\mathbf{x}$ and new multiplier variables $\boldsymbol{\lambda}$.
$\lambda_i$: The Lagrange multiplier for the $i$-th constraint — a scalar that acts as a "price" on violating that constraint.
$\boldsymbol{\lambda}$: The vector $(\lambda_1, \dots, \lambda_m)$ of all multipliers.
$- \sum_i \lambda_i g_i(\mathbf{x})$: A penalty term: zero on the feasible set, but pulling $\mathbf{x}$ around when constraints are violated. (Some authors use $+$; both conventions exist, only signs of $\lambda$ change.)

Story. Joseph-Louis Lagrange invented this in 1788 to do Newtonian mechanics under constraints. The multipliers turned out to be physical: for a ball on a string, $\lambda$ is literally the tension in the string. In economics, multipliers are "shadow prices" — the marginal value of relaxing a constraint by one unit.

The fundamental theorem of Lagrange multipliers: at a constrained optimum, there exists a $\boldsymbol{\lambda}^\star$ such that:

\nabla_\mathbf{x} \mathcal{L}(\mathbf{x}^\star, \boldsymbol{\lambda}^\star) = \mathbf{0}, \qquad g_i(\mathbf{x}^\star) = 0 \;\; \forall i

Lagrange stationarity

$\nabla_\mathbf{x} \mathcal{L}$: The gradient of the Lagrangian with respect to $\mathbf{x}$ only (treating $\boldsymbol{\lambda}$ as fixed).
$\nabla_\mathbf{x} \mathcal{L} = \mathbf{0}$: Equivalent to $\nabla f = \sum_i \lambda_i \nabla g_i$ — the gradient of $f$ is a linear combination of the constraint gradients. Geometrically: $\nabla f$ is perpendicular to the constraint surface, so there's no direction along the surface that decreases $f$.
$g_i(\mathbf{x}^\star) = 0$: Primal feasibility — the constraints actually hold.

Geometric picture. The level curves of $f$ touch the constraint surface tangentially at the optimum — they don't cross it. If they crossed, you could move along the surface and decrease $f$.

Worked example

Minimize $f(x, y) = x^2 + y^2$ subject to $x + y = 1$. The Lagrangian is

\mathcal{L}(x, y, \lambda) = x^2 + y^2 - \lambda (x + y - 1)

Worked-example Lagrangian

$x, y$: The two decision variables.
$\lambda$: One multiplier (one constraint).
$x + y - 1$: The constraint rewritten in the form $g(x,y) = 0$.

Setting partials to zero: $2x - \lambda = 0$, $2y - \lambda = 0$, $x + y = 1$. So $x = y = \tfrac{1}{2}$, and $\lambda = 1$. The minimum value of $f$ is $\tfrac{1}{2}$. The multiplier $\lambda = 1$ tells you something physical: if you relaxed the constraint to $x + y = 1 + \delta$, the optimal value would change by approximately $-\lambda \, \delta = -\delta$. Multipliers are the sensitivity of the optimum to the constraint.

Source — Lagrange Multipliers

// Problem: minimize f(x1, x2) = x1^2 + x2^2
//          subject to  g(x1, x2) = x1 + x2 - 1 = 0
//
// Lagrangian:  L(x1, x2, lam) = x1^2 + x2^2 - lam*(x1+x2-1)
//
// KKT stationarity:
//   dL/dx1 = 2*x1 - lam = 0  =>  x1 = lam/2
//   dL/dx2 = 2*x2 - lam = 0  =>  x2 = lam/2
//   dL/dlam = -(x1+x2-1) = 0  =>  x1+x2 = 1
//
// Substituting: lam/2 + lam/2 = 1  =>  lam = 1
//               x1 = x2 = 1/2
//
// Analytic solve (matrix form):
//   [2  0  -1] [x1]   [0]
//   [0  2  -1] [x2] = [0]
//   [1  1   0] [lam]  [1]
// Solve the 3x3 system => x1=0.5, x2=0.5, lam=1

import numpy as np
from scipy.optimize import minimize, root

# ---- Analytic: solve KKT system as a linear system ----
# Variables z = [x1, x2, lam]
# System: [2,0,-1; 0,2,-1; 1,1,0] * z = [0,0,1]
A = np.array([[2, 0, -1],
              [0, 2, -1],
              [1, 1,  0]], dtype=float)
b = np.array([0.0, 0.0, 1.0])
z = np.linalg.solve(A, b)
print("KKT solution (x1, x2, lam) =", z)
# => [0.5, 0.5, 1.0]

# ---- Numerical: solve grad(L)=0 via scipy.root ----
def lagrangian_grad(z):
    x1, x2, lam = z
    return [2*x1 - lam,       # dL/dx1
            2*x2 - lam,       # dL/dx2
            -(x1 + x2 - 1)]   # dL/dlam = -g

sol = root(lagrangian_grad, x0=[0, 0, 0])
print("root result:", sol.x.round(6))
# => [0.5, 0.5, 1.0]

# ---- scipy.optimize.minimize with explicit constraint ----
res = minimize(
    lambda v: v[0]**2 + v[1]**2,
    x0=[0.0, 0.0],
    constraints={"type": "eq", "fun": lambda v: v[0] + v[1] - 1},
    method="SLSQP",
)
print("SLSQP:", res.x.round(6), "  f =", round(res.fun, 6))
# => [0.5, 0.5]   f = 0.5

// Solve the 3x3 KKT system via Gaussian elimination
// [2,0,-1; 0,2,-1; 1,1,0] * [x1,x2,lam] = [0,0,1]
function solve3x3(M, b) {
  // augmented matrix [M | b]
  const A = M.map((row, i) => [...row, b[i]]);
  for (let col = 0; col < 3; col++) {
    let pivot = col;
    for (let r = col+1; r < 3; r++)
      if (Math.abs(A[r][col]) > Math.abs(A[pivot][col])) pivot = r;
    [A[col], A[pivot]] = [A[pivot], A[col]];
    for (let r = col+1; r < 3; r++) {
      const f = A[r][col] / A[col][col];
      for (let k = col; k <= 3; k++) A[r][k] -= f * A[col][k];
    }
  }
  const x = [0,0,0];
  for (let i = 2; i >= 0; i--) {
    x[i] = A[i][3];
    for (let j = i+1; j < 3; j++) x[i] -= A[i][j]*x[j];
    x[i] /= A[i][i];
  }
  return x;
}

const M = [[2,0,-1],[0,2,-1],[1,1,0]];
const b = [0, 0, 1];
const [x1, x2, lam] = solve3x3(M, b);
console.log(`x1=${x1}, x2=${x2}, lam=${lam}`);
// => x1=0.5, x2=0.5, lam=1

#include <stdio.h>

// Gaussian elimination for 3x3 system
void solve3x3(double A[3][4], double *x) {
    for (int col = 0; col < 3; col++) {
        int pivot = col;
        for (int r = col+1; r < 3; r++)
            if (A[r][col]*A[r][col] > A[pivot][col]*A[pivot][col]) pivot = r;
        double tmp[4];
        for (int k=0;k<4;k++){tmp[k]=A[col][k];A[col][k]=A[pivot][k];A[pivot][k]=tmp[k];}
        for (int r = col+1; r < 3; r++) {
            double f = A[r][col] / A[col][col];
            for (int k = col; k < 4; k++) A[r][k] -= f * A[col][k];
        }
    }
    for (int i = 2; i >= 0; i--) {
        x[i] = A[i][3];
        for (int j = i+1; j < 3; j++) x[i] -= A[i][j]*x[j];
        x[i] /= A[i][i];
    }
}

int main(void) {
    // KKT: [2,0,-1; 0,2,-1; 1,1,0]*[x1;x2;lam]=[0;0;1]
    double A[3][4] = {{2,0,-1,0},{0,2,-1,0},{1,1,0,1}};
    double z[3];
    solve3x3(A, z);
    printf("x1=%.4f  x2=%.4f  lam=%.4f\n", z[0], z[1], z[2]);
    return 0;
}

#include <iostream>
#include <array>
#include <cmath>

// Solve 3x3 via Gaussian elimination
std::array<double,3> solve3x3(std::array<std::array<double,4>,3> A) {
    for (int col = 0; col < 3; col++) {
        int pivot = col;
        for (int r = col+1; r < 3; r++)
            if (std::abs(A[r][col]) > std::abs(A[pivot][col])) pivot = r;
        std::swap(A[col], A[pivot]);
        for (int r = col+1; r < 3; r++) {
            double f = A[r][col] / A[col][col];
            for (int k = col; k < 4; k++) A[r][k] -= f * A[col][k];
        }
    }
    std::array<double,3> x{};
    for (int i = 2; i >= 0; i--) {
        x[i] = A[i][3];
        for (int j = i+1; j < 3; j++) x[i] -= A[i][j]*x[j];
        x[i] /= A[i][i];
    }
    return x;
}

int main() {
    auto z = solve3x3({{{2,0,-1,0},{0,2,-1,0},{1,1,0,1}}});
    std::cout << "x1=" << z[0] << "  x2=" << z[1] << "  lam=" << z[2] << "\n";
}

public class LagrangeKKT {
    // Gaussian elimination for Ax=b, A is n x n+1 augmented
    static double[] gaussElim(double[][] A) {
        int n = A.length;
        for (int col = 0; col < n; col++) {
            int pivot = col;
            for (int r = col+1; r < n; r++)
                if (Math.abs(A[r][col]) > Math.abs(A[pivot][col])) pivot = r;
            double[] tmp = A[col]; A[col] = A[pivot]; A[pivot] = tmp;
            for (int r = col+1; r < n; r++) {
                double f = A[r][col] / A[col][col];
                for (int k = col; k <= n; k++) A[r][k] -= f * A[col][k];
            }
        }
        double[] x = new double[n];
        for (int i = n-1; i >= 0; i--) {
            x[i] = A[i][n];
            for (int j = i+1; j < n; j++) x[i] -= A[i][j]*x[j];
            x[i] /= A[i][i];
        }
        return x;
    }

    public static void main(String[] args) {
        double[][] A = {{2,0,-1,0},{0,2,-1,0},{1,1,0,1}};
        double[] z = gaussElim(A);
        System.out.printf("x1=%.4f  x2=%.4f  lam=%.4f%n", z[0], z[1], z[2]);
    }
}

package main

import (
	"fmt"
	"math"
)

// Gaussian elimination: A is [n][n+1] augmented matrix
func gaussElim(A [3][4]float64) [3]float64 {
	for col := 0; col < 3; col++ {
		pivot := col
		for r := col + 1; r < 3; r++ {
			if math.Abs(A[r][col]) > math.Abs(A[pivot][col]) { pivot = r }
		}
		A[col], A[pivot] = A[pivot], A[col]
		for r := col + 1; r < 3; r++ {
			f := A[r][col] / A[col][col]
			for k := col; k < 4; k++ { A[r][k] -= f * A[col][k] }
		}
	}
	var x [3]float64
	for i := 2; i >= 0; i-- {
		x[i] = A[i][3]
		for j := i + 1; j < 3; j++ { x[i] -= A[i][j] * x[j] }
		x[i] /= A[i][i]
	}
	return x
}

func main() {
	A := [3][4]float64{{2,0,-1,0},{0,2,-1,0},{1,1,0,1}}
	z := gaussElim(A)
	fmt.Printf("x1=%.4f  x2=%.4f  lam=%.4f\n", z[0], z[1], z[2])
}

11. KKT conditions — Lagrange for inequalities

Lagrange multipliers handle equality constraints. For inequalities, you need the generalization due to Karush, Kuhn, and Tucker (1939, 1951). For the problem

\min_\mathbf{x} f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) = 0, \; h_j(\mathbf{x}) \leq 0

General constrained problem

$g_i, h_j$: Equality and inequality constraint functions.
s.t.: "Subject to."

define a Lagrangian with two sets of multipliers:

\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}) = f(\mathbf{x}) + \sum_i \lambda_i g_i(\mathbf{x}) + \sum_j \mu_j h_j(\mathbf{x})

KKT Lagrangian

$\lambda_i$: Multiplier for the $i$-th equality. Sign unrestricted.
$\mu_j$: Multiplier for the $j$-th inequality. Required to be non-negative.
$g_i, h_j$: The constraint functions.

The KKT conditions: at any local optimum (under mild regularity), there exist multipliers $\boldsymbol{\lambda}^\star, \boldsymbol{\mu}^\star$ such that

\begin{aligned} & \nabla f(\mathbf{x}^\star) + \sum_i \lambda_i^\star \nabla g_i(\mathbf{x}^\star) + \sum_j \mu_j^\star \nabla h_j(\mathbf{x}^\star) = \mathbf{0} && \text{(stationarity)} \\ & g_i(\mathbf{x}^\star) = 0, \quad h_j(\mathbf{x}^\star) \leq 0 && \text{(primal feasibility)} \\ & \mu_j^\star \geq 0 && \text{(dual feasibility)} \\ & \mu_j^\star \, h_j(\mathbf{x}^\star) = 0 && \text{(complementary slackness)} \end{aligned}

The KKT system

Stationarity: The gradient of the Lagrangian in $\mathbf{x}$ vanishes — same idea as Lagrange, generalized.
Primal feasibility: The constraints actually hold at $\mathbf{x}^\star$.
Dual feasibility: Inequality multipliers are non-negative. Otherwise the multiplier would be "paying you" to violate the constraint — nonsensical.
Complementary slackness: For each inequality, either the constraint is active ($h_j = 0$, the constraint is "tight") or its multiplier is zero ($\mu_j = 0$, the constraint is irrelevant). One of the two has to be true.

Picture. Imagine boxes on a shelf. A box that's pressed against the wall has a non-zero contact force ($\mu > 0$, $h = 0$). A box that's nowhere near the wall has zero contact force ($\mu = 0$, $h < 0$). The complementary slackness says: contact force times distance to wall is always zero. There's no third option.

For convex problems with smooth functions and a regularity condition (called Slater's condition), KKT is both necessary and sufficient — solving the KKT system gives you the global optimum. For non-convex problems KKT is only necessary, so you can use it to verify a candidate but not to find one from scratch.

KKT in disguise

You've already seen KKT, even if you didn't recognize it. The dual formulation of a Support Vector Machine drops out of the KKT conditions. The "active set" methods used in industrial QP solvers are KKT-driven. The "shadow prices" in linear programming are KKT multipliers. Once you spot the pattern, every constrained problem looks like a KKT problem.

Source — KKT Conditions

// Verify KKT conditions for a solved QP:
//   minimize   (1/2) x^T Q x + c^T x
//   subject to  A x <= b,  x >= 0
//
// KKT conditions:
//   1. Stationarity:    Q*x + c + A^T*mu = 0
//   2. Primal feas.:    A*x <= b,  x >= 0
//   3. Dual feas.:      mu >= 0
//   4. Comp. slackness: mu_j * (A_j^T x - b_j) = 0  for each j

function verify_kkt(Q, c, A, b, x, mu, tol):
    stat    = norm(Q*x + c + A^T*mu) < tol
    prim_ub = all(A*x <= b + tol)
    prim_nn = all(x >= -tol)
    dual    = all(mu >= -tol)
    slack   = all(abs(mu * (A*x - b)) < tol)
    return stat and prim_ub and prim_nn and dual and slack

import numpy as np

def verify_kkt(Q, c, A, b, x, mu, tol=1e-6):
    """
    Verify KKT conditions for:
        min  (1/2) x^T Q x + c^T x
        s.t. A x <= b

    Q : (n,n) PSD matrix
    c : (n,) cost vector
    A : (m,n) constraint matrix
    b : (m,) RHS vector
    x : (n,) primal solution
    mu: (m,) dual variables (KKT multipliers for inequalities)
    """
    # 1. Stationarity: grad_x L = Q x + c + A^T mu = 0
    stationarity = np.linalg.norm(Q @ x + c + A.T @ mu) < tol

    # 2. Primal feasibility: A x <= b
    prim_feas = np.all(A @ x <= b + tol)

    # 3. Dual feasibility: mu >= 0
    dual_feas = np.all(mu >= -tol)

    # 4. Complementary slackness: mu_j * (A_j x - b_j) = 0
    slack = A @ x - b              # should be <= 0
    comp_slack = np.all(np.abs(mu * slack) < tol)

    ok = stationarity and prim_feas and dual_feas and comp_slack
    print(f"Stationarity:        {stationarity}")
    print(f"Primal feasibility:  {prim_feas}")
    print(f"Dual feasibility:    {dual_feas}")
    print(f"Complementary slack: {comp_slack}")
    print(f"KKT satisfied:       {ok}")
    return ok

# Example: min x1^2 + x2^2  s.t. x1 + x2 <= 1, x1,x2 >= 0
# Solved: x* = [0.5, 0.5] (constraint active), mu* = [1.0]
Q  = np.eye(2) * 2          # gradient of (1/2) x^T (2I) x = x
c  = np.zeros(2)
A  = np.array([[ 1.,  1.],  # x1 + x2 <= 1
               [-1.,  0.],  # -x1 <= 0  (i.e. x1 >= 0)
               [ 0., -1.]]) # -x2 <= 0
b  = np.array([1., 0., 0.])
x  = np.array([0.5, 0.5])
mu = np.array([1.0, 0.0, 0.0])  # only first constraint active

verify_kkt(Q, c, A, b, x, mu)

// Verify KKT for min x1^2+x2^2 s.t. x1+x2 <= 1, x1,x2 >= 0
// x* = [0.5, 0.5], mu* = [1, 0, 0]

function dot(a, b)    { return a.reduce((s,ai,i) => s + ai*b[i], 0); }
function norm(v)      { return Math.sqrt(dot(v, v)); }
function matVec(M, v) { return M.map(row => dot(row, v)); }
function transpose(M) { return M[0].map((_, j) => M.map(row => row[j])); }
function vecAdd(a, b) { return a.map((ai, i) => ai + b[i]); }

function verifyKKT(Q, c, A, b, x, mu, tol=1e-6) {
  const n = x.length, m = mu.length;
  const Qx = matVec(Q, x);
  const At = transpose(A);
  const Atmu = matVec(At, mu);
  const stat_vec = vecAdd(vecAdd(Qx, c), Atmu);
  const stationarity = norm(stat_vec) < tol;

  const Ax = matVec(A, x);
  const primFeas = Ax.every((v, i) => v <= b[i] + tol);
  const dualFeas = mu.every(m => m >= -tol);
  const compSlack = Ax.every((v, j) => Math.abs(mu[j] * (v - b[j])) < tol);

  console.log({stationarity, primFeas, dualFeas, compSlack,
               "KKT satisfied": stationarity && primFeas && dualFeas && compSlack});
}

const Q  = [[2,0],[0,2]];
const c  = [0, 0];
const A  = [[1,1],[-1,0],[0,-1]];
const b  = [1, 0, 0];
const x  = [0.5, 0.5];
const mu = [1.0, 0.0, 0.0];
verifyKKT(Q, c, A, b, x, mu);

#include <stdio.h>
#include <math.h>

#define N 2
#define M 3

// Q=2I, c=0, A=[[1,1],[-1,0],[0,-1]], b=[1,0,0]
// x=[0.5,0.5], mu=[1,0,0]

int main(void) {
    double Q[N][N] = {{2,0},{0,2}};
    double c[N]    = {0, 0};
    double A[M][N] = {{1,1},{-1,0},{0,-1}};
    double b[M]    = {1, 0, 0};
    double x[N]    = {0.5, 0.5};
    double mu[M]   = {1.0, 0.0, 0.0};
    double tol     = 1e-6;

    // 1. Stationarity: Q*x + c + A^T*mu
    double stat[N] = {0};
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) stat[i] += Q[i][j]*x[j];
        stat[i] += c[i];
        for (int j = 0; j < M; j++) stat[i] += A[j][i]*mu[j];
    }
    double sn = sqrt(stat[0]*stat[0]+stat[1]*stat[1]);
    printf("Stationarity norm: %.2e  ok=%d\n", sn, sn < tol);

    // 2. Primal feas: A*x <= b
    int pf = 1;
    for (int j = 0; j < M; j++) {
        double v = 0; for (int i=0;i<N;i++) v+=A[j][i]*x[i];
        if (v > b[j]+tol) pf=0;
    }
    printf("Primal feasible: %d\n", pf);

    // 3. Dual feas + comp slack
    int df=1, cs=1;
    for (int j=0;j<M;j++) {
        double v=0; for(int i=0;i<N;i++) v+=A[j][i]*x[i];
        if (mu[j] < -tol) df=0;
        if (fabs(mu[j]*(v-b[j])) > tol) cs=0;
    }
    printf("Dual feasible: %d, Comp slack: %d\n", df, cs);
    return 0;
}

#include <iostream>
#include <vector>
#include <cmath>
#include <numeric>

using Vec = std::vector<double>;
using Mat = std::vector<Vec>;

double dot(const Vec& a, const Vec& b) {
    return std::inner_product(a.begin(), a.end(), b.begin(), 0.0);
}

bool verifyKKT(const Mat& Q, const Vec& c,
               const Mat& A, const Vec& b,
               const Vec& x, const Vec& mu, double tol=1e-6) {
    int n=x.size(), m=mu.size();
    // Stationarity
    double sn=0;
    for (int i=0;i<n;i++) {
        double s=c[i];
        for (int j=0;j<n;j++) s+=Q[i][j]*x[j];
        for (int j=0;j<m;j++) s+=A[j][i]*mu[j];
        sn+=s*s;
    }
    bool stat = std::sqrt(sn) < tol;
    bool pf=true, df=true, cs=true;
    for (int j=0;j<m;j++){
        double v=dot(A[j],x);
        if (v > b[j]+tol) pf=false;
        if (mu[j] < -tol) df=false;
        if (std::abs(mu[j]*(v-b[j])) > tol) cs=false;
    }
    std::cout<<"Stationarity:"<<stat<<" Primal:"<<pf
              <<" Dual:"<<df<<" CompSlack:"<<cs<<"\n";
    return stat&&pf&&df&&cs;
}

int main() {
    Mat Q={{2,0},{0,2}};
    Vec c={0,0};
    Mat A={{1,1},{-1,0},{0,-1}};
    Vec b={1,0,0}, x={0.5,0.5}, mu={1,0,0};
    verifyKKT(Q,c,A,b,x,mu);
}

public class KKTVerify {
    static double dot(double[] a, double[] b) {
        double s=0; for(int i=0;i<a.length;i++) s+=a[i]*b[i]; return s;
    }
    static boolean verify(double[][] Q, double[] c, double[][] A, double[] b,
                          double[] x, double[] mu, double tol) {
        int n=x.length, m=mu.length;
        double sn=0;
        for(int i=0;i<n;i++){
            double s=c[i];
            for(int j=0;j<n;j++) s+=Q[i][j]*x[j];
            for(int j=0;j<m;j++) s+=A[j][i]*mu[j];
            sn+=s*s;
        }
        boolean stat = Math.sqrt(sn) < tol;
        boolean pf=true, df=true, cs=true;
        for(int j=0;j<m;j++){
            double v=dot(A[j],x);
            if(v>b[j]+tol) pf=false;
            if(mu[j]<-tol) df=false;
            if(Math.abs(mu[j]*(v-b[j]))>tol) cs=false;
        }
        System.out.printf("stat=%b prim=%b dual=%b compSlack=%b%n",stat,pf,df,cs);
        return stat&&pf&&df&&cs;
    }
    public static void main(String[] a){
        verify(new double[][]{{2,0},{0,2}}, new double[]{0,0},
               new double[][]{{1,1},{-1,0},{0,-1}}, new double[]{1,0,0},
               new double[]{0.5,0.5}, new double[]{1,0,0}, 1e-6);
    }
}

package main

import (
	"fmt"
	"math"
)

func dot(a, b []float64) float64 {
	s := 0.0; for i := range a { s += a[i]*b[i] }; return s
}

func verifyKKT(Q, A [][]float64, c, b, x, mu []float64, tol float64) bool {
	n, m := len(x), len(mu)
	sn := 0.0
	for i := 0; i < n; i++ {
		s := c[i]
		for j := 0; j < n; j++ { s += Q[i][j] * x[j] }
		for j := 0; j < m; j++ { s += A[j][i] * mu[j] }
		sn += s * s
	}
	stat := math.Sqrt(sn) < tol
	pf, df, cs := true, true, true
	for j := 0; j < m; j++ {
		v := dot(A[j], x)
		if v > b[j]+tol { pf = false }
		if mu[j] < -tol  { df = false }
		if math.Abs(mu[j]*(v-b[j])) > tol { cs = false }
	}
	fmt.Printf("stat=%v prim=%v dual=%v compSlack=%v\n", stat, pf, df, cs)
	return stat && pf && df && cs
}

func main() {
	Q  := [][]float64{{2,0},{0,2}}
	c  := []float64{0, 0}
	A  := [][]float64{{1,1},{-1,0},{0,-1}}
	b  := []float64{1, 0, 0}
	x  := []float64{0.5, 0.5}
	mu := []float64{1, 0, 0}
	verifyKKT(Q, A, c, b, x, mu, 1e-6)
}

12. Linear programming — the workhorse of operations research

A linear program (LP) is the special case where everything is linear. The standard form:

\min_\mathbf{x} \; \mathbf{c}^\top \mathbf{x} \quad \text{s.t.} \quad A \mathbf{x} = \mathbf{b}, \quad \mathbf{x} \geq 0

LP standard form

$\mathbf{c}$: Cost vector — coefficients of the linear objective.
$\mathbf{c}^\top \mathbf{x}$: The dot product $\sum_i c_i x_i$ — a linear scalar function of $\mathbf{x}$.
$A$: An $m \times n$ matrix encoding the linear equality constraints.
$\mathbf{b}$: An $m$-vector of right-hand sides.
$\mathbf{x} \geq 0$: Componentwise non-negativity — every $x_i$ must be $\geq 0$.

Why this form. Every LP can be massaged into this form by adding "slack variables" to turn inequalities into equalities and splitting unrestricted variables into a positive and negative part. Solvers expect this form, so it's the lingua franca.

The feasible region of an LP is a polytope — the intersection of finitely many half-spaces. It looks like a faceted gemstone in $n$ dimensions. The objective $\mathbf{c}^\top \mathbf{x}$ is a linear function, which means its level sets are flat hyperplanes sliding across the polytope. The minimum, if it exists, is attained at a vertex of the polytope (or along an edge in degenerate cases). This is the geometric foundation of LP and the reason the entire field works.

The simplex method

Dantzig's simplex method (1947) walks from vertex to neighboring vertex along edges of the polytope, always picking an edge that improves the objective. When no improving edge exists, you're done. In the worst case it can visit exponentially many vertices, but in practice it's fast on real problems — one of the great empirical mysteries of computational math, partly explained by smoothed analysis (Spielman & Teng, 2001).

Interior-point methods

Karmarkar's 1984 method (and the modern primal-dual interior-point methods that descended from it) take a completely different approach: instead of walking the boundary, drift through the interior of the polytope toward the optimum, following a smooth "central path." Provably polynomial-time. Modern solvers (HiGHS, Gurobi, Mosek) use both methods and pick whichever is faster on the given problem.

LP duality — a free second problem

Every LP has a dual LP. For the standard form above, the dual is:

\max_\mathbf{y} \; \mathbf{b}^\top \mathbf{y} \quad \text{s.t.} \quad A^\top \mathbf{y} \leq \mathbf{c}

LP dual

$\mathbf{y}$: Dual variables — one per primal constraint. They're exactly the KKT multipliers of the primal.
$\mathbf{b}^\top \mathbf{y}$: Dual objective — coefficients are the primal RHS.
$A^\top \mathbf{y} \leq \mathbf{c}$: Dual constraints — one per primal variable.

Why duals matter. Strong duality for LP says the optimal primal and dual values are equal. So the dual gives you a free certificate: any feasible $\mathbf{y}$ gives a lower bound on the primal optimum. Stop the simplex method anywhere and the dual variables tell you how good your current solution is. This is the basis of branch-and-bound for integer programs.

Source — Linear Program (Simplex)

// 2-variable LP (maximize c^T x  s.t. Ax <= b, x >= 0)
// Simplex: walk vertices of feasible polytope toward better objective.

// EXAMPLE:
//   max  x1 + x2
//   s.t. x1 + 2*x2 <= 4
//        3*x1 +  x2 <= 6
//        x1, x2 >= 0
//
// Vertices: (0,0), (2,0), (8/5, 6/5), (0,2)
// Objective values: 0, 2, 14/5, 2
// Optimum: x* = (8/5, 6/5), obj* = 14/5 = 2.8

// Generic simplex sketch (standard form):
//   Maintain basis B (indices of basic variables)
//   while any reduced cost c_j - c_B^T B^-1 A_j < 0:
//       pick entering variable j (most negative reduced cost)
//       compute direction, find leaving variable (minimum ratio test)
//       pivot: swap entering/leaving in B
//   return x (basic feasible solution)

from scipy.optimize import linprog
import numpy as np

# ---- Diet problem from section §12 ----
# Minimize cost of diet meeting nutritional minimums.
# Foods: bread, milk, cheese, potato, fish, yogurt
c = np.array([2.0, 3.5, 8.0, 1.5, 11.0, 3.0])   # cost per serving

# Nutrition per serving: (calories, protein_g, calcium_mg)
nutrition = np.array([
    [90,  4,   15],   # bread
    [120, 8,   300],  # milk
    [110, 7,   200],  # cheese
    [160, 3,   10],   # potato
    [200, 22,  10],   # fish
    [150, 10,  450],  # yogurt
])

# Want at least: 2000 cal, 55g protein, 800mg calcium
# linprog expects A_ub @ x <= b_ub, so negate:
A_ub = -nutrition.T
b_ub = -np.array([2000.0, 55.0, 800.0])

res = linprog(c, A_ub=A_ub, b_ub=b_ub, method="highs")
print("Servings:", res.x.round(3))
print("Daily cost: $", round(res.fun, 2))
# Shadow prices = KKT multipliers (how much cost changes per unit nutrient)
print("Shadow prices:", res.ineqlin.marginals.round(4))

# ---- Simple 2-variable LP ----
# max x1 + x2  s.t.  x1+2x2<=4,  3x1+x2<=6
# (linprog minimizes, so negate objective)
c2   = [-1.0, -1.0]
A_ub2 = [[1, 2], [3, 1]]
b_ub2 = [4.0, 6.0]
res2  = linprog(c2, A_ub=A_ub2, b_ub=b_ub2, bounds=[(0,None),(0,None)])
print("2-var LP: x =", res2.x.round(4), "  obj =", -round(res2.fun, 4))

// 2-variable LP solved by enumerating vertices (feasible for tiny problems)
// max x1 + x2  s.t. x1+2*x2 <= 4, 3*x1+x2 <= 6, x1,x2 >= 0

function lp2var(c, A, b) {
  // Enumerate intersections of constraint pairs (and axes)
  const n = A.length;
  let best = null, bestVal = -Infinity;

  const checkPt = (x1, x2) => {
    if (x1 < -1e-9 || x2 < -1e-9) return;
    for (let i = 0; i < n; i++)
      if (A[i][0]*x1 + A[i][1]*x2 > b[i] + 1e-9) return;
    const val = c[0]*x1 + c[1]*x2;
    if (val > bestVal) { bestVal = val; best = [x1, x2]; }
  };

  checkPt(0, 0);
  // Axes
  for (let i = 0; i < n; i++) {
    if (Math.abs(A[i][0]) > 1e-9) checkPt(b[i]/A[i][0], 0);
    if (Math.abs(A[i][1]) > 1e-9) checkPt(0, b[i]/A[i][1]);
  }
  // Constraint-constraint intersections
  for (let i = 0; i < n; i++)
    for (let j = i+1; j < n; j++) {
      const det = A[i][0]*A[j][1] - A[i][1]*A[j][0];
      if (Math.abs(det) < 1e-9) continue;
      const x1 = (b[i]*A[j][1] - b[j]*A[i][1]) / det;
      const x2 = (A[i][0]*b[j] - A[j][0]*b[i]) / det;
      checkPt(x1, x2);
    }
  return { x: best, obj: bestVal };
}

const c = [1, 1];
const A = [[1, 2], [3, 1]];
const b = [4, 6];
const { x, obj } = lp2var(c, A, b);
console.log("x =", x.map(v => v.toFixed(4)), "  obj =", obj.toFixed(4));
// => x = [1.6, 1.2]   obj = 2.8

#include <stdio.h>
#include <math.h>
#include <float.h>

#define NCON 2

// max x1+x2 s.t. x1+2x2<=4, 3x1+x2<=6, x1,x2>=0
// Enumerate vertices

static double A[NCON][2] = {{1,2},{3,1}};
static double b[NCON]    = {4, 6};

int feasible(double x1, double x2) {
    if (x1 < -1e-9 || x2 < -1e-9) return 0;
    for (int i = 0; i < NCON; i++)
        if (A[i][0]*x1 + A[i][1]*x2 > b[i] + 1e-9) return 0;
    return 1;
}

void check(double x1, double x2, double *best, double *bx, double *by) {
    if (!feasible(x1, x2)) return;
    double v = x1 + x2;
    if (v > *best) { *best = v; *bx = x1; *by = x2; }
}

int main(void) {
    double best = -DBL_MAX, bx = 0, by = 0;
    check(0, 0, &best, &bx, &by);
    for (int i = 0; i < NCON; i++) {
        if (fabs(A[i][0]) > 1e-9) check(b[i]/A[i][0], 0,            &best, &bx, &by);
        if (fabs(A[i][1]) > 1e-9) check(0,             b[i]/A[i][1], &best, &bx, &by);
    }
    for (int i = 0; i < NCON; i++)
        for (int j = i+1; j < NCON; j++) {
            double det = A[i][0]*A[j][1] - A[i][1]*A[j][0];
            if (fabs(det) < 1e-9) continue;
            double x1 = (b[i]*A[j][1] - b[j]*A[i][1]) / det;
            double x2 = (A[i][0]*b[j] - A[j][0]*b[i]) / det;
            check(x1, x2, &best, &bx, &by);
        }
    printf("x = [%.4f, %.4f]  obj = %.4f\n", bx, by, best);
    return 0;
}

#include <iostream>
#include <vector>
#include <limits>
#include <cmath>

// max x1+x2 s.t. x1+2x2<=4, 3x1+x2<=6, x1,x2>=0
// Enumerate vertices

struct LP2 {
    std::vector<std::array<double,2>> A;
    std::vector<double> b;

    bool feasible(double x1, double x2) const {
        if (x1 < -1e-9 || x2 < -1e-9) return false;
        for (size_t i = 0; i < A.size(); i++)
            if (A[i][0]*x1+A[i][1]*x2 > b[i]+1e-9) return false;
        return true;
    }

    std::pair<double,double> solve() const {
        double best=-std::numeric_limits<double>::infinity(), bx=0, by=0;
        auto chk = [&](double x1,double x2){
            if (!feasible(x1,x2)) return;
            if (x1+x2 > best) { best=x1+x2; bx=x1; by=x2; }
        };
        chk(0,0);
        int n=A.size();
        for(int i=0;i<n;i++){
            if(std::abs(A[i][0])>1e-9) chk(b[i]/A[i][0],0);
            if(std::abs(A[i][1])>1e-9) chk(0,b[i]/A[i][1]);
        }
        for(int i=0;i<n;i++) for(int j=i+1;j<n;j++){
            double det=A[i][0]*A[j][1]-A[i][1]*A[j][0];
            if(std::abs(det)<1e-9) continue;
            double x1=(b[i]*A[j][1]-b[j]*A[i][1])/det;
            double x2=(A[i][0]*b[j]-A[j][0]*b[i])/det;
            chk(x1,x2);
        }
        return {bx, by};
    }
};

int main() {
    LP2 lp{ {{1,2},{3,1}}, {4,6} };
    auto [x1,x2] = lp.solve();
    std::cout << "x=[" << x1 << "," << x2 << "]  obj=" << x1+x2 << "\n";
}

public class LP2Var {
    static double[][] A = {{1,2},{3,1}};
    static double[] b   = {4, 6};

    static boolean feasible(double x1, double x2) {
        if (x1 < -1e-9 || x2 < -1e-9) return false;
        for (int i = 0; i < A.length; i++)
            if (A[i][0]*x1 + A[i][1]*x2 > b[i]+1e-9) return false;
        return true;
    }

    static double[] solve() {
        double best = Double.NEGATIVE_INFINITY;
        double[] sol = {0, 0};
        double[][] cands = {{0,0}};
        // axis intercepts
        for (int i = 0; i < A.length; i++) {
            if (Math.abs(A[i][0]) > 1e-9) eval(new double[]{b[i]/A[i][0], 0}, sol);
            if (Math.abs(A[i][1]) > 1e-9) eval(new double[]{0, b[i]/A[i][1]}, sol);
        }
        // pairwise intersections
        for (int i=0;i<A.length;i++) for(int j=i+1;j<A.length;j++){
            double det=A[i][0]*A[j][1]-A[i][1]*A[j][0];
            if(Math.abs(det)<1e-9) continue;
            double x1=(b[i]*A[j][1]-b[j]*A[i][1])/det;
            double x2=(A[i][0]*b[j]-A[j][0]*b[i])/det;
            eval(new double[]{x1,x2}, sol);
        }
        return sol;
    }

    static double[] bestSol = {0, 0};
    static double bestVal = Double.NEGATIVE_INFINITY;
    static void eval(double[] x, double[] s){
        if (!feasible(x[0],x[1])) return;
        double v=x[0]+x[1];
        if(v>bestVal){bestVal=v;bestSol[0]=x[0];bestSol[1]=x[1];}
    }

    public static void main(String[] args) {
        solve();
        System.out.printf("x=[%.4f,%.4f]  obj=%.4f%n",bestSol[0],bestSol[1],bestVal);
    }
}

package main

import (
	"fmt"
	"math"
)

var A = [2][2]float64{{1, 2}, {3, 1}}
var b = [2]float64{4, 6}

func feasible(x1, x2 float64) bool {
	if x1 < -1e-9 || x2 < -1e-9 { return false }
	for i := range A {
		if A[i][0]*x1+A[i][1]*x2 > b[i]+1e-9 { return false }
	}
	return true
}

func solveLp() (float64, float64) {
	best, bx, by := math.Inf(-1), 0.0, 0.0
	chk := func(x1, x2 float64) {
		if !feasible(x1, x2) { return }
		if v := x1 + x2; v > best { best = v; bx = x1; by = x2 }
	}
	chk(0, 0)
	for i := range A {
		if math.Abs(A[i][0]) > 1e-9 { chk(b[i]/A[i][0], 0) }
		if math.Abs(A[i][1]) > 1e-9 { chk(0, b[i]/A[i][1]) }
	}
	for i := range A {
		for j := i + 1; j < len(A); j++ {
			det := A[i][0]*A[j][1] - A[i][1]*A[j][0]
			if math.Abs(det) < 1e-9 { continue }
			x1 := (b[i]*A[j][1] - b[j]*A[i][1]) / det
			x2 := (A[i][0]*b[j] - A[j][0]*b[i]) / det
			chk(x1, x2)
		}
	}
	return bx, by
}

func main() {
	x1, x2 := solveLp()
	fmt.Printf("x=[%.4f,%.4f]  obj=%.4f\n", x1, x2, x1+x2)
}

13. Integer programming — LP gets hard

Most operations problems aren't continuous: you can't run half a flight, hire 0.7 nurses, or build 2.3 warehouses. Integer programming is LP with the extra constraint that some or all variables are integers:

\min_\mathbf{x} \; \mathbf{c}^\top \mathbf{x} \quad \text{s.t.} \quad A \mathbf{x} \leq \mathbf{b}, \quad \mathbf{x} \in \mathbb{Z}^n

Integer linear program

$\mathbb{Z}^n$: The set of $n$-vectors of integers.
otherwise: Same as LP.

Adding integrality is a small notational change with a big complexity blowup: integer programming is NP-hard, even for binary variables ($x_i \in \{0, 1\}$). The boundary between LP (poly-time) and IP (NP-hard) is the cleanest example of how a "small" change can move a problem from "easy" to "the universe might end before you solve it." For more on what NP-hard means and why it matters, see the upcoming page on theoretical computer science.

Modern IP solvers (Gurobi, CPLEX) routinely handle problems with millions of variables anyway, using two key techniques:

Branch and bound. Solve the LP relaxation (drop integrality). If the answer is integer, done. Otherwise pick a fractional variable, branch on $x_i \leq \lfloor x_i \rfloor$ vs $x_i \geq \lceil x_i \rceil$, and recurse. Use the LP relaxation as a lower bound to prune branches.
Cutting planes. Add new linear inequalities ("cuts") that are valid for the integer program but cut off fractional LP solutions. Gomory cuts (1958) are the classical example.

The combination is called branch-and-cut, and it's why a Boeing scheduling problem with 50 million variables and a billion constraints can be solved overnight.

Source — Integer Program (Branch & Bound)

// 0/1 Knapsack via branch-and-bound
// n items with weights w[i] and values v[i], capacity W
// Maximize sum of v[i]*x[i]  s.t. sum of w[i]*x[i] <= W, x[i] in {0,1}

global best_value = 0
global best_solution = []

function branch_and_bound(items, W, index, current_weight, current_value, taken):
    if index == n:
        if current_value > best_value:
            best_value = current_value
            best_solution = copy(taken)
        return

    // Pruning: upper bound via fractional relaxation of remaining items
    ub = current_value + fractional_bound(items[index:], W - current_weight)
    if ub <= best_value:
        return        // prune this branch

    // Branch: do NOT take item index
    branch_and_bound(items, W, index+1, current_weight, current_value, taken)

    // Branch: take item index (if it fits)
    if current_weight + w[index] <= W:
        taken[index] = 1
        branch_and_bound(items, W, index+1,
                         current_weight + w[index],
                         current_value  + v[index], taken)
        taken[index] = 0

def knapsack_bb(weights, values, W):
    """
    0/1 Knapsack via branch-and-bound.
    Returns (best_value, selected_indices).
    Items should be sorted by value/weight ratio (descending) for tighter bounds.
    """
    n = len(weights)
    # Sort by density for tighter upper bounds
    order = sorted(range(n), key=lambda i: values[i]/weights[i], reverse=True)
    w = [weights[i] for i in order]
    v = [values[i]  for i in order]

    best = [0]    # best value found
    taken_best = [[]]

    def upper_bound(idx, cap, val):
        """Fractional knapsack bound on remaining items."""
        for i in range(idx, n):
            if cap <= 0: break
            if w[i] <= cap:
                val += v[i]; cap -= w[i]
            else:
                val += v[i] * (cap / w[i]); break
        return val

    def bb(idx, cap, val, taken):
        if idx == n or cap == 0:
            if val > best[0]:
                best[0] = val
                taken_best[0] = taken[:]
            return
        if upper_bound(idx, cap, val) <= best[0]:
            return   # prune
        # skip item idx
        bb(idx + 1, cap, val, taken)
        # take item idx
        if w[idx] <= cap:
            taken.append(order[idx])
            bb(idx + 1, cap - w[idx], val + v[idx], taken)
            taken.pop()

    bb(0, W, 0, [])
    return best[0], sorted(taken_best[0])

weights = [2, 3, 4, 5, 1]
values  = [3, 4, 5, 8, 2]
W       = 7
val, sel = knapsack_bb(weights, values, W)
print(f"Best value: {val}, items: {sel}")
# => Best value: 13, items: [2, 3]  (0-indexed)

function knapsackBB(weights, values, W) {
  const n = weights.length;
  // Sort by density
  const order = Array.from({length:n},(_,i)=>i)
    .sort((a,b) => values[b]/weights[b] - values[a]/weights[a]);
  const w = order.map(i=>weights[i]);
  const v = order.map(i=>values[i]);

  let best = 0, bestTaken = [];

  function ub(idx, cap, val) {
    for (let i=idx; i<n && cap>0; i++) {
      if (w[i] <= cap) { val+=v[i]; cap-=w[i]; }
      else { val += v[i]*(cap/w[i]); break; }
    }
    return val;
  }

  function bb(idx, cap, val, taken) {
    if (idx===n || cap===0) {
      if (val > best) { best=val; bestTaken=[...taken]; } return;
    }
    if (ub(idx,cap,val) <= best) return;
    bb(idx+1, cap, val, taken);
    if (w[idx] <= cap) {
      taken.push(order[idx]);
      bb(idx+1, cap-w[idx], val+v[idx], taken);
      taken.pop();
    }
  }

  bb(0, W, 0, []);
  return { value: best, items: bestTaken.sort((a,b)=>a-b) };
}

const w=[2,3,4,5,1], v=[3,4,5,8,2], W=7;
console.log(knapsackBB(w, v, W));
// => { value: 13, items: [2, 3] }

#include <stdio.h>
#include <string.h>

#define MAXN 20
int n;
int W;
int wt[MAXN], val[MAXN];
int best_val;
int best_taken[MAXN], cur_taken[MAXN];

// Simple upper bound: sum remaining values ignoring weights
int upper_bound(int idx, int cap, int cur_val) {
    int v = cur_val;
    for (int i = idx; i < n; i++) {
        if (cap <= 0) break;
        if (wt[i] <= cap) { v += val[i]; cap -= wt[i]; }
        else { v += val[i] * cap / wt[i]; break; }
    }
    return v;
}

void bb(int idx, int cap, int cur_val) {
    if (idx == n) {
        if (cur_val > best_val) {
            best_val = cur_val;
            memcpy(best_taken, cur_taken, n * sizeof(int));
        }
        return;
    }
    if (upper_bound(idx, cap, cur_val) <= best_val) return;
    // skip
    bb(idx + 1, cap, cur_val);
    // take
    if (wt[idx] <= cap) {
        cur_taken[idx] = 1;
        bb(idx + 1, cap - wt[idx], cur_val + val[idx]);
        cur_taken[idx] = 0;
    }
}

int main(void) {
    n=5; W=7;
    int tw[]={2,3,4,5,1}, tv[]={3,4,5,8,2};
    for(int i=0;i<n;i++){wt[i]=tw[i];val[i]=tv[i];}
    best_val=0; memset(best_taken,0,sizeof(best_taken));
    bb(0, W, 0);
    printf("Best value: %d, items: ", best_val);
    for(int i=0;i<n;i++) if(best_taken[i]) printf("%d ",i);
    printf("\n");
    return 0;
}

#include <iostream>
#include <vector>
#include <algorithm>

struct Item { int w, v; };

struct KnapsackBB {
    const std::vector<Item>& items;
    int W, n;
    int best = 0;
    std::vector<int> bestTaken;

    int upperBound(int idx, int cap, int val) const {
        for (int i=idx; i<n && cap>0; i++) {
            if (items[i].w <= cap) { val+=items[i].v; cap-=items[i].w; }
            else { val += items[i].v * cap / items[i].w; break; }
        }
        return val;
    }

    void bb(int idx, int cap, int val, std::vector<int>& taken) {
        if (idx==n) { if (val>best){best=val;bestTaken=taken;} return; }
        if (upperBound(idx,cap,val) <= best) return;
        bb(idx+1, cap, val, taken);
        if (items[idx].w <= cap) {
            taken.push_back(idx);
            bb(idx+1, cap-items[idx].w, val+items[idx].v, taken);
            taken.pop_back();
        }
    }

    void solve() { std::vector<int> t; bb(0,W,0,t); }
};

int main() {
    std::vector<Item> items = {{2,3},{3,4},{4,5},{5,8},{1,2}};
    KnapsackBB kb{items, 7, (int)items.size()};
    kb.solve();
    std::cout << "Best: " << kb.best << "  items: ";
    for (int i : kb.bestTaken) std::cout << i << " ";
    std::cout << "\n";
}

import java.util.*;

public class KnapsackBB {
    static int n, W;
    static int[] wt, val;
    static int bestVal = 0;
    static int[] bestTaken;

    static int upperBound(int idx, int cap, int cur) {
        int v = cur;
        for (int i = idx; i < n && cap > 0; i++) {
            if (wt[i] <= cap) { v += val[i]; cap -= wt[i]; }
            else { v += val[i] * cap / wt[i]; break; }
        }
        return v;
    }

    static void bb(int idx, int cap, int cur, int[] taken) {
        if (idx == n) {
            if (cur > bestVal) { bestVal=cur; bestTaken=taken.clone(); } return;
        }
        if (upperBound(idx,cap,cur) <= bestVal) return;
        bb(idx+1, cap, cur, taken);
        if (wt[idx] <= cap) {
            taken[idx] = 1;
            bb(idx+1, cap-wt[idx], cur+val[idx], taken);
            taken[idx] = 0;
        }
    }

    public static void main(String[] args) {
        n=5; W=7;
        wt = new int[]{2,3,4,5,1};
        val= new int[]{3,4,5,8,2};
        bestTaken = new int[n];
        bb(0, W, 0, new int[n]);
        System.out.print("Best: " + bestVal + "  items: ");
        for (int i=0;i<n;i++) if(bestTaken[i]==1) System.out.print(i+" ");
        System.out.println();
    }
}

package main

import "fmt"

var (
	wt   = []int{2, 3, 4, 5, 1}
	val  = []int{3, 4, 5, 8, 2}
	W    = 7
	n    = 5
	best = 0
	bestTaken []int
)

func upperBound(idx, cap, cur int) int {
	v := cur
	for i := idx; i < n && cap > 0; i++ {
		if wt[i] <= cap { v += val[i]; cap -= wt[i] } else {
			v += val[i] * cap / wt[i]; break
		}
	}
	return v
}

func bb(idx, cap, cur int, taken []int) {
	if idx == n {
		if cur > best { best = cur; bestTaken = append([]int{}, taken...) }
		return
	}
	if upperBound(idx, cap, cur) <= best { return }
	bb(idx+1, cap, cur, taken)
	if wt[idx] <= cap {
		bb(idx+1, cap-wt[idx], cur+val[idx], append(taken, idx))
	}
}

func main() {
	bb(0, W, 0, nil)
	fmt.Println("Best:", best, " items:", bestTaken)
}

14. Convex optimization — beyond LP

LP is the easiest convex problem. Everything else convex is also tractable, just more expensive. The standard convex hierarchy:

Class	Form	What's added
LP	linear $f$ and constraints	baseline
QP	quadratic $f$, linear constraints	$\tfrac{1}{2}\mathbf{x}^\top Q \mathbf{x}$ in objective with $Q \succeq 0$
QCQP	quadratic constraints too	convex quadratic constraints
SOCP	second-order cone	$\\|A\mathbf{x} + \mathbf{b}\\|_2 \leq \mathbf{c}^\top \mathbf{x} + d$
SDP	semidefinite	matrix variables with $X \succeq 0$

Each level subsumes the one above. SDPs solve LPs, QPs, and SOCPs as special cases — but they're slower, so use the smallest class that fits your problem.

Quadratic programs

A QP looks like:

\min_\mathbf{x} \; \tfrac{1}{2} \mathbf{x}^\top Q \mathbf{x} + \mathbf{c}^\top \mathbf{x} \quad \text{s.t.} \quad A \mathbf{x} \leq \mathbf{b}

Quadratic program

$Q$: An $n \times n$ symmetric positive semi-definite matrix. PSD is what makes the QP convex.
$\tfrac{1}{2} \mathbf{x}^\top Q \mathbf{x}$: A quadratic form in $\mathbf{x}$ — bowl-shaped because $Q \succeq 0$.
$\mathbf{c}^\top \mathbf{x}$: A linear term — tilts the bowl.
$A \mathbf{x} \leq \mathbf{b}$: Linear inequality constraints.

Where you'll see it. Markowitz portfolio optimization is a QP with $Q$ = covariance matrix. Support vector machines are a QP. Model-predictive control reduces to a QP per timestep. QPs are everywhere.

CVXPY — modeling instead of solving

You shouldn't write convex optimization solvers. People have spent decades doing that better than you can in a weekend. Instead, use a modeling language: in Python, that's CVXPY. You write the problem in math-like syntax, CVXPY recognizes the convex structure and ships it to the right solver (ECOS, SCS, MOSEK, Clarabel). The price is a small overhead and a learning curve. The benefit is that you express problems at the level of "minimize the LASSO regression loss" rather than "implement the proximal gradient method."

Source — Convex QP (CVXPY)

// QP: minimize  (1/2) x^T Q x + c^T x
//     subject to  Ax <= b
// where Q is symmetric PSD.

// Via gradient descent on the quadratic (unconstrained for now):
// grad f(x) = Q x + c
// x_{t+1} = x_t - lr * (Q x_t + c)
// For unconstrained QP, this converges to x* = -Q^{-1} c.

// With linear constraints, use projected gradient descent:
// 1. Take a GD step: x_new = x - lr * (Qx + c)
// 2. Project onto feasible set: x = project(x_new, {Ax <= b, x >= 0})
//    Projection onto box: x_i = clip(x_i, lo_i, hi_i)
// Repeat until convergence.

import numpy as np

# ---- CVXPY approach (recommended) ----
try:
    import cvxpy as cp

    # Markowitz portfolio QP:
    # minimize  x^T Sigma x   (variance)
    # s.t.      mu^T x >= r_min,  sum(x) == 1,  x >= 0
    n_assets = 4
    np.random.seed(0)
    Sigma = np.random.randn(n_assets, n_assets)
    Sigma = Sigma @ Sigma.T / n_assets   # PSD covariance
    mu    = np.array([0.1, 0.12, 0.08, 0.15])  # expected returns
    r_min = 0.10

    x = cp.Variable(n_assets)
    prob = cp.Problem(
        cp.Minimize(cp.quad_form(x, Sigma)),
        [mu @ x >= r_min, cp.sum(x) == 1, x >= 0]
    )
    prob.solve()
    print("Portfolio weights:", x.value.round(4))
    print("Expected return:  ", (mu @ x.value).round(4))
    print("Variance:          ", prob.value.round(6))

except ImportError:
    print("cvxpy not installed; falling back to gradient descent on unconstrained QP")

# ---- Gradient descent on unconstrained QP ----
# minimize (1/2) x^T Q x + c^T x  ==>  optimal at x* = -Q^{-1} c
def qp_gd(Q, c, x0, lr=0.01, n=2000):
    x = np.array(x0, float)
    for _ in range(n):
        grad = Q @ x + c
        x -= lr * grad
    return x

Q = np.array([[4., 1.], [1., 2.]])
c = np.array([-4., -6.])
x_opt = qp_gd(Q, c, x0=[0., 0.])
x_exact = np.linalg.solve(Q, -c)
print("GD solution:    ", x_opt.round(4))
print("Exact solution: ", x_exact.round(4))

// Gradient descent on unconstrained QP: min (1/2) x^T Q x + c^T x
// grad = Q x + c

function matVec(Q, x) { return Q.map(row => row.reduce((s,qi,j) => s+qi*x[j], 0)); }

function qpGD(Q, c, x0, lr=0.01, n=2000) {
  let x = [...x0];
  for (let t = 0; t < n; t++) {
    const Qx   = matVec(Q, x);
    const grad = Qx.map((v, i) => v + c[i]);
    x = x.map((xi, i) => xi - lr * grad[i]);
  }
  return x;
}

// min (1/2)(4x1^2 + 2x1*x2 + 2x2^2) - 4x1 - 6x2
// Q = [[4,1],[1,2]], c = [-4,-6]
// Exact: x* = Q^{-1}*[-c] = [1, 2.5] (approx)
const Q = [[4,1],[1,2]];
const c = [-4,-6];
const x = qpGD(Q, c, [0, 0]);
console.log("QP solution:", x.map(v => v.toFixed(4)));
// => [~1.0, ~2.5] (unconstrained optimum Q*x=-c)

#include <stdio.h>

#define N 2

// Gradient descent on (1/2) x^T Q x + c^T x
void qp_gd(double Q[N][N], double c[N], double x[N], double lr, int steps) {
    for (int t = 0; t < steps; t++) {
        double grad[N] = {0};
        for (int i = 0; i < N; i++) {
            for (int j = 0; j < N; j++) grad[i] += Q[i][j] * x[j];
            grad[i] += c[i];
        }
        for (int i = 0; i < N; i++) x[i] -= lr * grad[i];
    }
}

int main(void) {
    double Q[N][N] = {{4, 1}, {1, 2}};
    double c[N]    = {-4, -6};
    double x[N]    = {0, 0};
    qp_gd(Q, c, x, 0.05, 2000);
    printf("QP solution: [%.4f, %.4f]\n", x[0], x[1]);
    return 0;
}

#include <iostream>
#include <vector>
#include <numeric>

using Vec = std::vector<double>;
using Mat = std::vector<Vec>;

Vec qpGD(const Mat& Q, const Vec& c, Vec x, double lr=0.05, int n=2000) {
    int d = x.size();
    for (int t = 0; t < n; t++) {
        Vec grad(d, 0);
        for (int i=0;i<d;i++) {
            for (int j=0;j<d;j++) grad[i]+=Q[i][j]*x[j];
            grad[i]+=c[i];
        }
        for (int i=0;i<d;i++) x[i]-=lr*grad[i];
    }
    return x;
}

int main() {
    Mat Q={{4,1},{1,2}};
    Vec c={-4,-6};
    Vec x = qpGD(Q, c, {0,0});
    std::cout << "[" << x[0] << ", " << x[1] << "]\n";
}

public class QPSolve {
    static double[] qpGD(double[][] Q, double[] c, double[] x0, double lr, int n) {
        int d = x0.length;
        double[] x = x0.clone();
        for (int t = 0; t < n; t++) {
            double[] grad = new double[d];
            for (int i=0;i<d;i++){
                for (int j=0;j<d;j++) grad[i]+=Q[i][j]*x[j];
                grad[i]+=c[i];
            }
            for (int i=0;i<d;i++) x[i]-=lr*grad[i];
        }
        return x;
    }

    public static void main(String[] args) {
        double[][] Q = {{4,1},{1,2}};
        double[] c   = {-4,-6};
        double[] x   = qpGD(Q, c, new double[]{0,0}, 0.05, 2000);
        System.out.printf("[%.4f, %.4f]%n", x[0], x[1]);
    }
}

package main

import "fmt"

func qpGD(Q [][]float64, c, x0 []float64, lr float64, n int) []float64 {
	d := len(x0)
	x := make([]float64, d)
	copy(x, x0)
	for t := 0; t < n; t++ {
		grad := make([]float64, d)
		for i := 0; i < d; i++ {
			for j := 0; j < d; j++ { grad[i] += Q[i][j] * x[j] }
			grad[i] += c[i]
		}
		for i := 0; i < d; i++ { x[i] -= lr * grad[i] }
	}
	return x
}

func main() {
	Q := [][]float64{{4, 1}, {1, 2}}
	c := []float64{-4, -6}
	x := qpGD(Q, c, []float64{0, 0}, 0.05, 2000)
	fmt.Printf("[%.4f, %.4f]\n", x[0], x[1])
}

15. Non-convex optimization in ML — the elephant

Neural network training is the most important non-convex optimization problem on Earth, and on paper it shouldn't work. The loss surface of a deep network has exponentially many local minima, an explosion of saddle points, and no convexity anywhere. So why does plain SGD train GPT-4?

The picture that's emerged since 2014:

In high dimensions, saddle points dominate, not local minima. Dauphin et al. (2014) and Pascanu et al. argued that as $n$ grows, "bad" local minima become exponentially rare while saddle points proliferate. A random critical point in $\mathbb{R}^n$ is overwhelmingly likely to be a saddle. Saddles slow you down but don't trap you — first-order methods drift past them.
Most local minima are roughly equally good. Choromanska et al. (2015) showed (for a simplified spin-glass model of neural nets) that local minima cluster near a single low band of objective values. Empirically the same thing holds for real networks — different random seeds give different weights but very similar loss. So even if you don't find the global min, you find a good min.
Overparameterization helps. When the network has many more parameters than training examples, the loss surface flattens — most directions in parameter space are degenerate (zero curvature), and the bad local minima vanish. This is one of the central insights of modern deep learning theory.
SGD has an implicit bias toward flat minima. Flat minima generalize better than sharp ones, and SGD's noise makes it preferentially settle in flat regions.

None of this is fully proven; it's an active research area. But empirically, the cocktail of "huge model + lots of data + Adam + SGD noise" is enough to find solutions that work, even on problems that should be hopelessly non-convex. See foundation models and backpropagation for what to do with this fact.

Source — Non-Convex Optimization

// Non-convex demo: f(x) = sin(x) + 0.1*x^2, local minima near x=-1.57, 4.7 ...
// GD gets trapped; random restart helps.

// 1. GD stuck in local minimum
function gd_stuck(x0, lr, n):
    x = x0
    for i = 1 to n:
        g = cos(x) + 0.2*x     // f'(x)
        x = x - lr * g
    return x

// 2. Random restart to find global minimum
function random_restart(f, f_grad, n_restarts, x_lo, x_hi, lr, n_steps):
    best_x = None;  best_f = +inf
    for k = 1 to n_restarts:
        x0 = uniform(x_lo, x_hi)
        x  = gd(f_grad, x0, lr, n_steps)
        if f(x) < best_f:
            best_f = f(x);  best_x = x
    return best_x, best_f

// 3. Saddle point: f(x,y) = x^2 - y^2
//    grad = [2x, -2y]; at (0,0) grad=0 but it's a saddle (not a minimum)
//    Hessian = diag(2, -2) — indefinite (one neg eigenvalue)

import numpy as np

# ---- 1. GD trapped in local minimum ----
def f(x):      return np.sin(x) + 0.1 * x**2
def f_grad(x): return np.cos(x) + 0.2 * x

def gd(x0, lr=0.05, n=200):
    x = float(x0)
    for _ in range(n):
        x -= lr * f_grad(x)
    return x

# Start near a local minimum (right side)
x_local = gd(x0=4.0)
print(f"GD from x0=4.0  => x={x_local:.4f}, f={f(x_local):.4f}")
# Near global minimum (left side)
x_global = gd(x0=-1.5)
print(f"GD from x0=-1.5 => x={x_global:.4f}, f={f(x_global):.4f}")

# ---- 2. Random restart ----
rng = np.random.default_rng(42)
best_x, best_f = None, np.inf
for _ in range(30):
    x0 = rng.uniform(-10, 10)
    x  = gd(x0)
    if f(x) < best_f:
        best_f = f(x); best_x = x
print(f"Best via random restart: x={best_x:.4f}, f={best_f:.4f}")

# ---- 3. Saddle point of f(x,y) = x^2 - y^2 ----
def f2(x):      return np.array([x[0]**2 - x[1]**2])
def g2(x):      return np.array([2*x[0], -2*x[1]])
def H2():       return np.array([[2., 0.], [0., -2.]])

x_saddle = np.array([0., 0.])
print("Grad at saddle:", g2(x_saddle))   # [0, 0] — critical point
eigvals = np.linalg.eigvalsh(H2())
print("Hessian eigenvalues:", eigvals)    # [-2, 2] — indefinite → saddle

function f(x)      { return Math.sin(x) + 0.1 * x*x; }
function fGrad(x)  { return Math.cos(x) + 0.2 * x; }

function gd(x0, lr=0.05, n=200) {
  let x = x0;
  for (let i = 0; i < n; i++) x -= lr * fGrad(x);
  return x;
}

// GD trapped
console.log("GD from 4.0:",  gd(4.0).toFixed(4),  "f=", f(gd(4.0)).toFixed(4));
console.log("GD from -1.5:", gd(-1.5).toFixed(4), "f=", f(gd(-1.5)).toFixed(4));

// Random restart
let bestX = null, bestF = Infinity;
for (let k = 0; k < 30; k++) {
  const x0 = (Math.random() - 0.5) * 20;
  const x  = gd(x0);
  if (f(x) < bestF) { bestF = f(x); bestX = x; }
}
console.log(`Random restart: x=${bestX.toFixed(4)}, f=${bestF.toFixed(4)}`);

// Saddle: f(x,y) = x^2 - y^2, grad=[2x,-2y], Hessian=diag(2,-2)
console.log("Saddle grad at (0,0):", [0, 0]);
console.log("Hessian eigenvalues: [-2, 2] => indefinite => saddle");

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <float.h>

double f(double x)     { return sin(x) + 0.1*x*x; }
double f_grad(double x){ return cos(x) + 0.2*x; }

double gd(double x0, double lr, int n) {
    double x = x0;
    for (int i = 0; i < n; i++) x -= lr * f_grad(x);
    return x;
}

int main(void) {
    double x1 = gd(4.0, 0.05, 200);
    double x2 = gd(-1.5, 0.05, 200);
    printf("GD from 4.0:  x=%.4f f=%.4f\n", x1, f(x1));
    printf("GD from -1.5: x=%.4f f=%.4f\n", x2, f(x2));

    // Random restart
    srand(42);
    double bestX = 0, bestF = DBL_MAX;
    for (int k = 0; k < 100; k++) {
        double x0 = ((double)rand()/RAND_MAX)*20.0 - 10.0;
        double x  = gd(x0, 0.05, 200);
        if (f(x) < bestF) { bestF = f(x); bestX = x; }
    }
    printf("Best restart: x=%.4f f=%.4f\n", bestX, bestF);
    return 0;
}

#include <iostream>
#include <cmath>
#include <random>
#include <limits>

double f(double x)     { return std::sin(x) + 0.1*x*x; }
double fGrad(double x) { return std::cos(x) + 0.2*x; }

double gd(double x0, double lr=0.05, int n=200) {
    double x = x0;
    for (int i=0;i<n;i++) x -= lr * fGrad(x);
    return x;
}

int main() {
    std::cout << "GD from 4.0:  " << gd(4.0) << " f=" << f(gd(4.0)) << "\n";
    std::cout << "GD from -1.5: " << gd(-1.5) << " f=" << f(gd(-1.5)) << "\n";

    std::mt19937 rng(42);
    std::uniform_real_distribution<double> dist(-10, 10);
    double bestX=0, bestF=std::numeric_limits<double>::infinity();
    for (int k=0;k<100;k++){
        double x = gd(dist(rng));
        if (f(x) < bestF) { bestF=f(x); bestX=x; }
    }
    std::cout << "Best restart: x=" << bestX << " f=" << bestF << "\n";
}

import java.util.Random;

public class NonConvex {
    static double f(double x)     { return Math.sin(x) + 0.1*x*x; }
    static double fGrad(double x) { return Math.cos(x) + 0.2*x; }

    static double gd(double x0, double lr, int n) {
        double x = x0;
        for (int i=0;i<n;i++) x -= lr * fGrad(x);
        return x;
    }

    public static void main(String[] args) {
        double x1 = gd(4.0, 0.05, 200);
        double x2 = gd(-1.5, 0.05, 200);
        System.out.printf("GD from 4.0:  x=%.4f f=%.4f%n", x1, f(x1));
        System.out.printf("GD from -1.5: x=%.4f f=%.4f%n", x2, f(x2));

        Random rng = new Random(42);
        double bestX=0, bestF=Double.MAX_VALUE;
        for (int k=0;k<100;k++){
            double x = gd(rng.nextDouble()*20-10, 0.05, 200);
            if (f(x) < bestF) { bestF=f(x); bestX=x; }
        }
        System.out.printf("Best restart: x=%.4f f=%.4f%n", bestX, bestF);
    }
}

package main

import (
	"fmt"
	"math"
	"math/rand"
)

func f(x float64) float64     { return math.Sin(x) + 0.1*x*x }
func fGrad(x float64) float64 { return math.Cos(x) + 0.2*x }

func gd(x0, lr float64, n int) float64 {
	x := x0
	for i := 0; i < n; i++ { x -= lr * fGrad(x) }
	return x
}

func main() {
	x1, x2 := gd(4.0, 0.05, 200), gd(-1.5, 0.05, 200)
	fmt.Printf("GD from 4.0:  x=%.4f f=%.4f\n", x1, f(x1))
	fmt.Printf("GD from -1.5: x=%.4f f=%.4f\n", x2, f(x2))

	rng := rand.New(rand.NewSource(42))
	bestX, bestF := 0.0, math.Inf(1)
	for k := 0; k < 100; k++ {
		x := gd(rng.Float64()*20-10, 0.05, 200)
		if v := f(x); v < bestF { bestF = v; bestX = x }
	}
	fmt.Printf("Best restart: x=%.4f f=%.4f\n", bestX, bestF)
}

16. Duality, briefly

For any constrained problem, the Lagrangian gives you a dual function:

g(\boldsymbol{\lambda}, \boldsymbol{\mu}) = \inf_\mathbf{x} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu})

Lagrange dual function

$g$: A function of the multipliers alone — the minimum of the Lagrangian over $\mathbf{x}$ for fixed $(\boldsymbol{\lambda}, \boldsymbol{\mu})$.
$\inf$: "Infimum" — the greatest lower bound, similar to min but always defined.
$\mathcal{L}$: The Lagrangian from earlier.

Property. $g$ is always concave in $(\boldsymbol{\lambda}, \boldsymbol{\mu})$, even if the original problem is non-convex. And $g(\boldsymbol{\lambda}, \boldsymbol{\mu}) \leq f(\mathbf{x}^\star)$ for any feasible multipliers — the dual gives a lower bound on the primal. Always.

The dual problem is to find the best (largest) lower bound:

\max_{\boldsymbol{\lambda}, \boldsymbol{\mu} \geq 0} \; g(\boldsymbol{\lambda}, \boldsymbol{\mu})

Dual problem

$\max$: Maximize over multipliers.
$\boldsymbol{\mu} \geq 0$: Inequality multipliers stay non-negative; equality multipliers are unrestricted.

The optimal dual value is always $\leq$ optimal primal value. This is weak duality. The gap between them is the duality gap. For convex problems with mild regularity (Slater), the gap is zero — this is strong duality — and the optimal dual variables are the KKT multipliers. For non-convex problems the gap can be huge, but the dual still gives a valid lower bound, which is enough to drive branch-and-bound and certify near-optimality.

Source — LP Duality

// Primal LP (standard form):
//   min  c^T x
//   s.t. Ax = b,  x >= 0
//
// Dual LP:
//   max  b^T y
//   s.t. A^T y <= c
//
// Strong duality (LP): primal_opt == dual_opt
// i.e.  c^T x* == b^T y*  at optimality
//
// Complementary slackness:
//   x_j * (c_j - A_j^T y) = 0  for all j
//   (either x_j = 0 or the j-th dual constraint is tight)

// VERIFY strong duality:
// 1. Solve primal  => optimal x*, primal_val = c^T x*
// 2. Solve dual    => optimal y*, dual_val   = b^T y*
// 3. Check |primal_val - dual_val| < tol  (should be ~0)

import numpy as np
from scipy.optimize import linprog

# ---- Primal LP ----
# min  c^T x   s.t.  Ax <= b, x >= 0
# max  x1 + x2  s.t.  x1+2x2<=4, 3x1+x2<=6
# (linprog minimizes, so negate)

c_p = np.array([-1.0, -1.0])          # minimize -x1 - x2
A_p = np.array([[1., 2.], [3., 1.]])
b_p = np.array([4., 6.])

res_primal = linprog(c_p, A_ub=A_p, b_ub=b_p, bounds=[(0,None),(0,None)])
primal_opt = -res_primal.fun           # negate back to max value
x_opt = res_primal.x
print(f"Primal: x* = {x_opt.round(4)}, max = {primal_opt:.4f}")

# ---- Dual LP (of the max form) ----
# Dual of  max c^T x  s.t. Ax <= b, x >= 0
# is:      min b^T y  s.t. A^T y >= c, y >= 0
# i.e. linprog dual: min b^T y  s.t. -A^T y <= -c

c_d  = b_p.copy()                     # min b^T y
A_d  = -A_p.T                         # -A^T y <= -c
b_d  = -np.array([-1., -1.])          # -c

res_dual = linprog(c_d, A_ub=A_d, b_ub=b_d, bounds=[(0,None),(0,None)])
dual_opt = res_dual.fun
y_opt = res_dual.x
print(f"Dual:   y* = {y_opt.round(4)}, min = {dual_opt:.4f}")

# ---- Strong duality check ----
gap = abs(primal_opt - dual_opt)
print(f"Duality gap: {gap:.2e}  (should be ~0)")

# ---- Complementary slackness ----
slack_primal = b_p - A_p @ x_opt      # b - A*x* (should be >= 0)
print(f"Primal slack: {slack_primal.round(6)}")
print(f"Dual*slack:   {(y_opt * slack_primal).round(6)}")
# Each y_j * slack_j should be 0 (complementary slackness)

// Primal: max x1+x2 s.t. x1+2x2<=4, 3x1+x2<=6, x1,x2>=0
// Dual:   min 4y1+6y2 s.t. y1+3y2>=1, 2y1+y2>=1, y1,y2>=0
//
// By strong duality: primal_opt == dual_opt
// Solved analytically: x*=[8/5,6/5]=>[1.6,1.2], primal=14/5=2.8
//                      y*=[1/5,2/5]=>[0.2,0.4], dual=4*0.2+6*0.4=0.8+2.4=3.2 ... ?
// Wait - let's verify with vertex enumeration for the dual too.

function lp2varMin(c, A, b) {
  // minimize c^T x  s.t. Ax >= b, x >= 0
  const n = A.length;
  let best = Infinity, bx = null;
  const chk = (x) => {
    if (x.some(v => v < -1e-9)) return;
    if (A.some((row,i) => row.reduce((s,ai,j)=>s+ai*x[j],0) < b[i]-1e-9)) return;
    const v = c.reduce((s,ci,i)=>s+ci*x[i],0);
    if (v < best) { best=v; bx=[...x]; }
  };
  // vertices: axes and pairwise intersections
  chk([0,0]);
  for (let i=0;i<n;i++){
    for (let j=i+1;j<n;j++){
      const det=A[i][0]*A[j][1]-A[i][1]*A[j][0];
      if(Math.abs(det)<1e-9) continue;
      const x1=(b[i]*A[j][1]-b[j]*A[i][1])/det;
      const x2=(A[i][0]*b[j]-A[j][0]*b[i])/det;
      chk([x1,x2]);
    }
  }
  return { x: bx, val: best };
}

// Dual: min 4y1+6y2  s.t.  y1+3y2 >= 1,  2y1+y2 >= 1,  y1,y2 >= 0
const dual = lp2varMin([4,6], [[1,3],[2,1]], [1,1]);
console.log("Dual y*:", dual.x.map(v=>v.toFixed(4)), "val:", dual.val.toFixed(4));
// Compare primal max=2.8 vs dual min=2.8 => strong duality holds

#include <stdio.h>
#include <math.h>
#include <float.h>

// Primal: max x1+x2 s.t. x1+2x2<=4, 3x1+x2<=6 => opt 2.8 at [1.6,1.2]
// Dual:   min 4y1+6y2 s.t. y1+3y2>=1, 2y1+y2>=1 => opt 2.8 at [0.2,0.4]
// Verify: c^T x* = b^T y* (strong duality)

int main(void) {
    double c[2]  = {1, 1};
    double x_opt[2] = {1.6, 1.2};       // primal solution
    double b[2]  = {4, 6};
    double y_opt[2] = {0.2, 0.4};       // dual solution

    double primal_val = c[0]*x_opt[0] + c[1]*x_opt[1];
    double dual_val   = b[0]*y_opt[0] + b[1]*y_opt[1];
    printf("Primal obj: %.4f\n", primal_val);   // 2.8
    printf("Dual obj:   %.4f\n", dual_val);     // 2.8
    printf("Gap:        %.2e\n", fabs(primal_val - dual_val));

    // Complementary slackness: y_j * (b_j - A_j*x) = 0
    double Ax[2];
    double A[2][2] = {{1,2},{3,1}};
    for (int i=0;i<2;i++){
        Ax[i]=0; for(int j=0;j<2;j++) Ax[i]+=A[i][j]*x_opt[j];
    }
    printf("y*(b-Ax): [%.4f, %.4f] (both should be 0)\n",
           y_opt[0]*(b[0]-Ax[0]), y_opt[1]*(b[1]-Ax[1]));
    return 0;
}

#include <iostream>
#include <cmath>

int main() {
    // Primal: max x1+x2 s.t. x1+2x2<=4, 3x1+x2<=6
    // x* = [1.6, 1.2], primal_val = 2.8
    // Dual:  min 4y1+6y2 s.t. y1+3y2>=1, 2y1+y2>=1
    // y* = [0.2, 0.4],  dual_val   = 2.8
    double c[]   = {1, 1};
    double xOpt[]= {1.6, 1.2};
    double b[]   = {4, 6};
    double yOpt[]= {0.2, 0.4};
    double A[2][2]= {{1,2},{3,1}};

    double pv = c[0]*xOpt[0]+c[1]*xOpt[1];
    double dv = b[0]*yOpt[0]+b[1]*yOpt[1];
    std::cout << "Primal: " << pv << "\nDual:   " << dv
              << "\nGap: " << std::abs(pv-dv) << "\n";

    // Comp. slackness
    for (int j=0;j<2;j++){
        double Ax=A[j][0]*xOpt[0]+A[j][1]*xOpt[1];
        std::cout << "y["<<j<<"] * slack = " << yOpt[j]*(b[j]-Ax) << "\n";
    }
}

public class LPDuality {
    public static void main(String[] args) {
        // Primal: max x1+x2 s.t. x1+2x2<=4, 3x1+x2<=6
        // x* = [1.6, 1.2], primal = 2.8
        // Dual:  min 4y1+6y2 s.t. y1+3y2>=1, 2y1+y2>=1
        // y* = [0.2, 0.4], dual = 2.8
        double[] c    = {1, 1};
        double[] xOpt = {1.6, 1.2};
        double[] b    = {4, 6};
        double[] yOpt = {0.2, 0.4};
        double[][] A  = {{1,2},{3,1}};

        double pv = c[0]*xOpt[0]+c[1]*xOpt[1];
        double dv = b[0]*yOpt[0]+b[1]*yOpt[1];
        System.out.printf("Primal: %.4f%nDual:   %.4f%nGap: %.2e%n",
                          pv, dv, Math.abs(pv-dv));
        for (int j=0;j<2;j++){
            double Ax=A[j][0]*xOpt[0]+A[j][1]*xOpt[1];
            System.out.printf("y[%d]*slack = %.4f%n", j, yOpt[j]*(b[j]-Ax));
        }
    }
}

package main

import (
	"fmt"
	"math"
)

func main() {
	c    := [2]float64{1, 1}
	xOpt := [2]float64{1.6, 1.2}
	b    := [2]float64{4, 6}
	yOpt := [2]float64{0.2, 0.4}
	A    := [2][2]float64{{1, 2}, {3, 1}}

	pv := c[0]*xOpt[0] + c[1]*xOpt[1]
	dv := b[0]*yOpt[0] + b[1]*yOpt[1]
	fmt.Printf("Primal: %.4f\nDual:   %.4f\nGap: %.2e\n", pv, dv, math.Abs(pv-dv))

	for j := 0; j < 2; j++ {
		Ax := A[j][0]*xOpt[0] + A[j][1]*xOpt[1]
		fmt.Printf("y[%d]*slack = %.4f\n", j, yOpt[j]*(b[j]-Ax))
	}
}

17. Connection to ML

Once you've internalized this page, every part of modern ML reads as optimization:

Loss functions are objectives. Cross-entropy, MSE, contrastive loss, RL value functions — all $f(\mathbf{x})$ where $\mathbf{x}$ is the parameter vector.
Training is gradient descent. Specifically, SGD with a momentum-and-second-moment-EMA optimizer (Adam), on the chain-rule-derived gradient (backprop). See backpropagation and gradient descent.
Regularization is constrained optimization. Weight decay is an $\ell_2$ penalty equivalent to a constraint on the weight norm. Dropout, batch norm, and early stopping all add implicit constraints.
Hyperparameter search is meta-optimization. Each evaluation requires running an inner optimization. Bayesian optimization, Hyperband, population-based training — all algorithms for the meta problem.
RL is dynamic optimization. Bellman's equation is the optimality condition for an infinite-horizon optimization problem. Policy gradient is just gradient ascent on expected return.
Alignment and safety are optimization too. RLHF is a constrained optimization: maximize reward subject to a KL penalty against the base model. See AI safety.

Optimization is the lingua franca. Every paper in modern ML eventually says "we optimize..." — and now you know what that sentence smuggles in.

Mathematical Optimization

1. Why optimization is the most important applied math topic

2. Anatomy of an optimization problem

Vocabulary cheat-sheet

3. Unconstrained optimization — the calculus crash course

Local vs global

4. Convexity — the dividing line between easy and hard

Convex sets

Convex functions

5. Interactive — gradient descent tracer

6. Gradient descent — the world's most-used algorithm

Convergence

Learning rate sensitivity

7. Variants — momentum, Adagrad, RMSProp, Adam

Momentum (Polyak, 1964)

Nesterov accelerated gradient (Nesterov, 1983)

Adagrad (Duchi, Hazan, Singer, 2011)

RMSProp (Hinton, 2012)

Adam (Kingma & Ba, 2014)

8. Stochastic gradient descent — when computing the gradient is too expensive

9. Newton's method and quasi-Newton

Quasi-Newton — BFGS and L-BFGS

10. Constrained optimization — Lagrange multipliers

Worked example

11. KKT conditions — Lagrange for inequalities

KKT in disguise

12. Linear programming — the workhorse of operations research

The simplex method

Interior-point methods

LP duality — a free second problem

13. Integer programming — LP gets hard

14. Convex optimization — beyond LP

Quadratic programs

CVXPY — modeling instead of solving

15. Non-convex optimization in ML — the elephant

16. Duality, briefly

17. Connection to ML

See also

Multivariable calculus

Linear algebra

Numerical analysis

Probability

Theoretical CS

Gradient descent (ML focus)

Further reading