Multivariable Calculus
Single-variable calculus studies functions of one number. Multivariable calculus studies functions of many — which is every function you'll ever meet in machine learning, physics, economics, and statistics. This page walks you from partial derivatives to the Hessian, with the multivariable chain rule (the math under backpropagation) in the middle.
1. Why you care
Almost nothing in the real world depends on exactly one number. The temperature of a room depends on three spatial coordinates and time. The price of an option depends on the underlying, the strike, the volatility, the interest rate, and the days to expiry. The loss of a neural network depends on every single weight and bias — tens of millions, sometimes trillions.
Single-variable calculus gave you the tools to analyze one-variable functions: limits, derivatives, integrals, the chain rule. Multivariable calculus lifts all of those tools to functions like
A function of several variables
- $f$
- The function itself. You plug in numbers, you get a number out.
- $x_1, x_2, \ldots, x_n$
- The $n$ inputs, usually called the independent variables. Each one is a real number.
- $n$
- The number of inputs — the dimension of the input space. $n=2$ for a 3D surface you can draw, $n=10^9$ for a modern language model.
Why this matters. When you train a neural network, you're minimizing a scalar loss $\mathcal{L}(\theta)$ where $\theta = (\theta_1, \ldots, \theta_n)$ is the weight vector and $n$ can be billions. Every tool on this page is about how a scalar changes when you wiggle a very long vector of inputs.
If you're rusty, here's the one-line upgrade: in single-variable land, a derivative is a number (the slope); in multivariable land, a derivative is a vector or a matrix — an object that has a slope for every direction you could move. The gradient is a vector. The Jacobian is a matrix. The Hessian is a matrix of second derivatives. Every headline result of this page is "here is the multivariable version of a thing you already know."
Skim path. If you only have five minutes, read the section headers, stare at the gradient figure in §9, and skip to §14. If you have thirty, read §3 (partial derivatives), §4 (gradient), §6 (chain rule) and §8 (Hessian). That's the part that makes the rest of the site make sense.
2. Functions of several variables
There are two kinds of functions you'll meet on this page, and they are not the same thing. Knowing which you're looking at saves a lot of confusion.
Scalar fields. These eat a point and return a single number. Formally, $f : \mathbb{R}^n \to \mathbb{R}$. Examples: the altitude above sea level as a function of latitude and longitude; the loss of a neural network as a function of its weights; the temperature inside a room.
Vector fields. These eat a point and return a vector. Formally, $\mathbf{F} : \mathbb{R}^n \to \mathbb{R}^m$. Examples: the velocity of wind at each point on Earth (two or three numbers per point); the activations of every neuron in a hidden layer as a function of the inputs.
Scalar vs. vector fields
- $\mathbb{R}^n$
- The set of all $n$-tuples of real numbers. $\mathbb{R}^2$ is the plane, $\mathbb{R}^3$ is 3D space, $\mathbb{R}^n$ for $n$ large is "parameter space."
- $\to$
- "Maps to." Read $f : \mathbb{R}^n \to \mathbb{R}$ as "$f$ is a rule that turns any vector of $n$ numbers into one number."
- $f$
- Lowercase for scalar-valued functions.
- $\mathbf{F}$
- Bold capital for vector-valued functions. The bold is a flag: "this returns a vector, watch yourself."
- $m$
- The dimension of the output. For a vector field on a surface, $m = 2$; for loss, $m = 1$.
Rule of thumb. If the thing you're asking about is a single number (loss, probability, energy, height), it's a scalar field — you can draw level curves and think about it as a landscape. If it's a direction or vector (velocity, force, neural activation vector), it's a vector field and you have to keep track of components.
Visualizing a 2D scalar field
For a function $f(x, y)$ of two variables there are three equivalent pictures. You'll want to flip between them depending on the question.
- 3D surface. Graph $z = f(x, y)$ as a surface floating over the $xy$-plane. Good for intuition, impossible for $n > 2$.
- Heatmap. Color the plane by $f(x, y)$ — dark = low, bright = high. Perfect for spotting valleys and ridges.
- Level curves (contour lines). Draw the curves where $f(x, y) = c$ for various constants $c$. This is the view topo maps use. Close curves = steep; far apart curves = flat.
Here is a tiny hand-drawn version of the three views of the simple bowl $f(x, y) = x^2 + y^2$:
The same function $f(x,y) = x^2 + y^2$ drawn three ways. Level curves are the most useful for high dimensions: you can talk about them when you can't draw them.
3. Partial derivatives
Single-variable calculus has one kind of derivative. Multivariable calculus has one kind per direction. The simplest versions are the partial derivatives: pick one input, hold every other input fixed, and take the ordinary one-variable derivative.
Partial derivative
- $\dfrac{\partial f}{\partial x_i}$
- "Partial derivative of $f$ with respect to $x_i$." The curly $\partial$ says "I'm only changing one variable." Sometimes written $f_{x_i}$ or $\partial_{x_i} f$.
- $\mathbf{a}$
- The point at which you're evaluating. Bold because it's a vector $\mathbf{a} = (a_1, \ldots, a_n)$.
- $h$
- A tiny scalar — the size of the nudge you apply to coordinate $i$ only. Taking $h \to 0$ gives the instantaneous rate.
- $a_i + h$
- Coordinate $i$ after the nudge. Every other coordinate stays at its original value.
- $\lim_{h \to 0}$
- Limit as $h$ shrinks to zero — same concept as single-variable calculus, just applied in one coordinate direction.
Picture. Stand on a hillside and face due east. Walk a tiny step in that direction. How fast did your altitude change per unit of eastward distance? That number is $\partial f / \partial x$. Turn 90° and do the same thing: $\partial f / \partial y$. You haven't moved in any other direction, so every other coordinate is frozen.
Worked example: the easy one
Let $f(x, y) = 3x^2 y + y^3$. Compute both partials.
For $\partial f / \partial x$, treat $y$ as if it were a constant:
Partial with respect to $x$
- $f(x, y) = 3x^2 y + y^3$
- The function we're differentiating. Two inputs, one output.
- $6xy$
- Result of treating $y$ as a constant: $\tfrac{d}{dx}(3x^2 y) = 6xy$, and $\tfrac{d}{dx}(y^3) = 0$ because $y^3$ is a constant when $y$ is frozen.
For $\partial f / \partial y$, freeze $x$ and differentiate as if only $y$ were changing:
Partial with respect to $y$
- $3x^2$
- From $\tfrac{d}{dy}(3 x^2 y) = 3 x^2$ (because $3 x^2$ is a constant factor when $x$ is frozen).
- $3y^2$
- From $\tfrac{d}{dy}(y^3) = 3 y^2$, using the power rule.
Trick. Taking partial derivatives is single-variable calculus. You just temporarily pretend the other letters are constants. Every rule you know — product, quotient, chain — still works.
Higher-order partials and a nicety
You can differentiate twice. $\partial^2 f / \partial x^2$ is "the partial with respect to $x$ of the partial with respect to $x$," and $\partial^2 f / \partial x \partial y$ means "differentiate first with respect to $y$, then with respect to $x$." For almost every function you'll ever meet,
Clairaut's theorem / equality of mixed partials
- $\dfrac{\partial^2 f}{\partial x \, \partial y}$
- The mixed partial: differentiate once in $y$, then once in $x$ (right-to-left, following the conventional order).
- $=$
- These are equal if both mixed partials exist and are continuous near the point. Which they always are for anything you're going to meet on this site.
Why it matters. This is why the Hessian matrix (§8) is symmetric — and symmetry is what lets you diagonalize it and talk about eigenvalues, which is how you classify saddle points vs minima.
4. The gradient
Partial derivatives give you $n$ separate slopes, one per axis. Stack them into a vector and you get the gradient.
The gradient
- $\nabla$
- "Nabla" or "del." It's an operator (a thing that eats a function and hands back another function), and $\nabla f$ is read "grad $f$" or "del $f$."
- $\nabla f(\mathbf{x})$
- The gradient vector of $f$ evaluated at the point $\mathbf{x}$. It lives in $\mathbb{R}^n$ — same dimension as the input.
- $\dfrac{\partial f}{\partial x_i}$
- The $i$-th partial, defined in §3. One scalar per coordinate direction.
- $\mathbf{x}$
- The point in $\mathbb{R}^n$ where you're evaluating the gradient. Different point, different gradient.
Intuition. The gradient is the all-the-partials-at-once vector. It answers "if I take a unit step in the best direction, how much does $f$ go up, and which way is best?" — in one object.
Why the gradient points uphill — the crucial fact
This is the single most important sentence in this whole page. The gradient $\nabla f(\mathbf{x})$ points in the direction of steepest ascent of $f$, and its length is the rate of ascent in that direction. Negated, $-\nabla f$ points in the direction of steepest descent. That is the fact underneath every gradient-descent optimizer ever written.
Here's why. Take a tiny step $\mathbf{v}$ from $\mathbf{x}$. To first order,
First-order Taylor approximation
- $\mathbf{v}$
- A small displacement vector in $\mathbb{R}^n$. Think "tiny arrow from $\mathbf{x}$."
- $f(\mathbf{x} + \mathbf{v})$
- The value of $f$ at the nearby point.
- $\nabla f(\mathbf{x}) \cdot \mathbf{v}$
- The dot product of the gradient with the step. This is a single number — the first-order change in $f$.
- $\approx$
- "Approximately equal." The approximation gets better as $\|\mathbf{v}\|$ gets smaller, and the error shrinks faster than linearly.
In plain English. Near $\mathbf{x}$, the function looks like a flat tilted plane, and the tilt is given by the gradient. Any small displacement changes $f$ by (roughly) gradient-dot-displacement.
If we constrain $\mathbf{v}$ to have fixed length (say $\|\mathbf{v}\| = 1$, a unit vector), then the change in $f$ is $\nabla f \cdot \mathbf{v}$, and the dot product is maximized when $\mathbf{v}$ points the same way as $\nabla f$:
Dot product in terms of angle
- $\|\nabla f\|$
- The length (Euclidean norm) of the gradient vector — a non-negative scalar.
- $\|\mathbf{v}\| = 1$
- We're looking at unit-length steps, so this drops out.
- $\theta$
- The angle between $\nabla f$ and $\mathbf{v}$. This is the only thing you can still vary.
- $\cos\theta$
- Takes its maximum value of $1$ when $\theta = 0$ — i.e., when $\mathbf{v}$ points the same way as $\nabla f$.
Conclusion. The direction that maximizes the first-order change in $f$ is the gradient's own direction, and the maximum change is $\|\nabla f\|$. That's why "the gradient points uphill" is not a mnemonic — it's a theorem about dot products.
A second, almost-free corollary: the gradient is perpendicular to the level curve passing through $\mathbf{x}$. Because along the level curve, $f$ doesn't change, so $\nabla f \cdot \mathbf{v} = 0$ for any tangent step $\mathbf{v}$ — and "dot product zero" means "perpendicular." Look at the concentric circles in §2 and imagine little arrows pointing straight out from the center. That's the gradient field of $x^2 + y^2$.
5. Directional derivatives
The gradient packages the slopes along the coordinate axes. But nothing stops you from asking "how fast is $f$ changing if I walk in this direction?" — where "this" is some arbitrary unit vector. That's the directional derivative, and it's a one-line fall-out from §4.
Directional derivative
- $D_{\mathbf{u}} f$
- The directional derivative of $f$ in direction $\mathbf{u}$. A single scalar — the rate at which $f$ changes per unit distance moved in direction $\mathbf{u}$.
- $\mathbf{u}$
- A unit vector, $\|\mathbf{u}\| = 1$. If you forget to normalize, you'll get a number that depends on the length of $\mathbf{u}$, which is meaningless. Always normalize.
- $\cdot$
- Dot product — see the linear algebra page.
Check. If $\mathbf{u}$ is the unit vector along the $x_i$-axis (all zeros except a $1$ in slot $i$), the dot product picks off the $i$-th component of the gradient — so $D_{\mathbf{u}} f = \partial f / \partial x_i$. Directional derivatives generalize partials; they don't replace them.
Worked example
Let $f(x,y) = x^2 + 3xy$ and take $\mathbf{x} = (1, 2)$. The gradient is $\nabla f = (2x + 3y, \; 3x) = (2 + 6, \; 3) = (8, 3)$. Now walk in the direction $\mathbf{u} = (1, 1) / \sqrt{2}$ — 45° northeast. Then
Computing the directional derivative
- $(8, 3)$
- The gradient of $f$ at $(1, 2)$. Two numbers, one per input.
- $\tfrac{1}{\sqrt 2}(1, 1)$
- The normalized "northeast" direction. Dividing by $\sqrt 2$ makes the length 1.
- $11 / \sqrt 2$
- The instantaneous rate of change of $f$ as you walk northeast from $(1, 2)$.
Sanity check. The rate is smaller than $\|\nabla f\| = \sqrt{64 + 9} \approx 8.54$, as it should be: moving in any direction other than the gradient's own is strictly less efficient for climbing.
6. The multivariable chain rule
The chain rule is the hinge of this whole page. Every other part is useful; this part is structural. Backpropagation — the algorithm that trains every modern neural network — is an application of the multivariable chain rule on a computation graph. If you learn one thing on this page, learn this.
Easy version: $z = f(x(t), y(t))$
You have a scalar function $f$ of two variables, and each of those variables is in turn a function of a single parameter $t$. You want $dz/dt$.
Chain rule — one parameter
- $z = f(x, y)$
- A scalar function of two intermediate variables.
- $x = x(t), \; y = y(t)$
- Both $x$ and $y$ depend on the same parameter $t$.
- $dz/dt$
- The total rate of change of $z$ as $t$ changes. We want one number.
- $\partial f / \partial x, \; \partial f / \partial y$
- How fast $f$ changes when you wiggle its inputs directly. These live at the current $(x, y)$.
- $dx/dt, \; dy/dt$
- How fast each intermediate variable changes as $t$ changes. Single-variable derivatives.
Why the sum? Wiggling $t$ wiggles both $x$ and $y$, and each wiggle pushes $z$ through its own partial. You add the two contributions because, to first order, their effects are independent. This is the same reason gradients add components — linearity.
General version: multiple intermediate variables
Now suppose $z$ is a scalar function of $n$ intermediate variables $g_1, g_2, \ldots, g_n$, and each $g_i$ depends on $m$ primitive inputs $t_1, \ldots, t_m$:
Chain rule — the general form
- $z$
- A scalar output (a loss, a cost, a probability).
- $g_1, \ldots, g_n$
- Intermediate variables. Each one might depend on any or all of the $t_j$. Think of them as the outputs of a hidden layer.
- $t_1, \ldots, t_m$
- The primitive inputs you actually control. In ML these are the network's weights and biases.
- $\dfrac{\partial z}{\partial g_i}$
- "Upstream gradient" — how much $z$ responds to a wiggle in the intermediate $g_i$ directly.
- $\dfrac{\partial g_i}{\partial t_j}$
- "Local gradient" — how much $g_i$ responds to a wiggle in the primitive input $t_j$.
- $\sum_{i=1}^{n}$
- Sum over all intermediate variables. If $t_j$ affects multiple $g_i$'s, each path contributes a term, and you add.
The core slogan. "Downstream sensitivity $=$ sum over paths of (upstream sensitivity $\times$ local sensitivity)." That is the entire chain rule, in any dimension, stated in words.
The computation graph view
Think of the chain rule geometrically on a directed graph: inputs $t_j$ at the bottom, intermediate nodes $g_i$ in the middle, scalar output $z$ at the top. Edges represent direct functional dependence. To compute $\partial z / \partial t_j$, you walk every path from $t_j$ up to $z$, multiplying edge gradients along the path, and then sum over all paths. That is exactly what backpropagation does — with one beautiful optimization.
Backpropagation is the observation that, instead of recomputing those path-sums independently for every input $t_j$, you can compute the gradient of $z$ with respect to every node in the graph in a single backward pass, because every node's gradient is a sum over its immediate successors — and those successors already have their gradients in hand when you reach the node. This is reverse-mode automatic differentiation; it costs roughly the same as one forward pass regardless of how many inputs you have. Without this trick, training a billion-parameter network would be impossible. Read the backpropagation page for the full unpacking.
Every time you see a chain-rule sum in a paper, remind yourself: that sum is a sum over edges in a graph, and it's there because your function is a composition of simpler functions. The "multivariable" part is purely about branching — when one variable affects multiple downstream things, you add.
7. The Jacobian matrix
So far we've stuck to scalar outputs. When the output is a vector, the right generalization of the derivative is a matrix: one row per output, one column per input. That matrix is the Jacobian.
The Jacobian matrix
- $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^m$
- A vector-valued function: $n$ inputs in, $m$ outputs out.
- $f_1, f_2, \ldots, f_m$
- The component functions. Each $f_i$ is a scalar function of all $n$ inputs.
- $\mathbf{J}_{\mathbf{f}}$
- The Jacobian matrix of $\mathbf{f}$. Sometimes written $D\mathbf{f}$ or $\partial \mathbf{f} / \partial \mathbf{x}$.
- Shape
- $m$ rows, $n$ columns. Row $i$ is the gradient of $f_i$ (transposed); column $j$ tells you how all $m$ outputs respond to wiggling input $j$.
- $\partial f_i / \partial x_j$
- The partial of the $i$-th output with respect to the $j$-th input — a scalar.
Why a matrix? Because the best linear approximation to a vector function near $\mathbf{x}$ is $\mathbf{f}(\mathbf{x} + \mathbf{v}) \approx \mathbf{f}(\mathbf{x}) + \mathbf{J}_{\mathbf{f}}(\mathbf{x}) \, \mathbf{v}$, and a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$ is, by definition, an $m \times n$ matrix. The Jacobian is "the derivative as a matrix."
Three important special cases. If $m = 1$ (scalar output), the Jacobian is a $1 \times n$ row vector — which, up to transpose, is exactly the gradient. If $n = 1$ (scalar input), the Jacobian is an $m \times 1$ column vector — the tangent vector to the curve $t \mapsto \mathbf{f}(t)$. If $m = n$, the Jacobian is square, and its determinant has a geometric meaning we'll use in the multiple-integrals section.
Worked example
Let $\mathbf{f}(x, y) = (x^2 y, \; x + \sin y)$. Two inputs, two outputs, so the Jacobian is $2 \times 2$.
A concrete Jacobian
- Row 1
- Partials of $f_1 = x^2 y$. $\partial f_1 / \partial x = 2xy$, $\partial f_1 / \partial y = x^2$.
- Row 2
- Partials of $f_2 = x + \sin y$. $\partial f_2 / \partial x = 1$, $\partial f_2 / \partial y = \cos y$.
Application. Multiplying a small input change $(dx, dy)$ by this matrix gives the linear approximation to the resulting output change $(df_1, df_2)$. For the same reason the gradient is the local best-linear-fit to a scalar field, the Jacobian is the local best-linear-fit to a vector field.
One more thing. The chain rule from §6 has a wonderfully clean restatement in Jacobian form: if $\mathbf{z} = \mathbf{f}(\mathbf{y})$ and $\mathbf{y} = \mathbf{g}(\mathbf{x})$, then
Chain rule in Jacobian form
- $\mathbf{f} \circ \mathbf{g}$
- Function composition — "do $\mathbf{g}$ first, then $\mathbf{f}$."
- $\mathbf{J}_{\mathbf{f}} \mathbf{J}_{\mathbf{g}}$
- Matrix product of the two Jacobians, evaluated at compatible points.
Moral. The multivariable chain rule is just matrix multiplication. This is why backprop can be written as a sequence of matrix-vector products — each layer contributes its local Jacobian, and the product of Jacobians (left-to-right from the output side) is the gradient of the loss with respect to the inputs.
8. The Hessian matrix
The gradient captures first-order information — the linear tilt of a scalar field. The second-order information — the curvature — lives in the Hessian matrix. The Hessian is to the gradient what the second derivative is to the first in single-variable calculus.
The Hessian matrix
- $\mathbf{H}_f$
- The Hessian of $f$. An $n \times n$ matrix of second partial derivatives.
- $\partial^2 f / \partial x_i^2$
- Diagonal entries — "pure" second partials. Curvature along each coordinate axis.
- $\partial^2 f / \partial x_i \partial x_j$
- Off-diagonal entries — "mixed" second partials. How the slope in direction $i$ changes as you move in direction $j$.
- Symmetry
- By Clairaut's theorem, $\partial^2 f / \partial x_i \partial x_j = \partial^2 f / \partial x_j \partial x_i$, so the Hessian is symmetric for any well-behaved $f$. This is a huge deal.
One-line slogan. The Hessian is the Jacobian of the gradient. Gradient gives you the slope as a vector function of position; take its Jacobian and you get the Hessian. That's why it's symmetric (equality of mixed partials) and that's why it matters (it's the best quadratic fit to $f$ near a point).
A second-order Taylor expansion makes the Hessian's role concrete:
Second-order Taylor expansion
- $\mathbf{v}$
- A small displacement from $\mathbf{x}$.
- $\nabla f \cdot \mathbf{v}$
- First-order (linear) correction — from §4.
- $\tfrac{1}{2} \mathbf{v}^\top \mathbf{H}_f \mathbf{v}$
- Second-order (quadratic) correction. This is a quadratic form: a number built by sandwiching the matrix $\mathbf{H}_f$ between $\mathbf{v}$ and its transpose.
- $\mathbf{v}^\top$
- The transpose of $\mathbf{v}$ — turning a column vector into a row vector so the matrix product is a scalar.
- $\tfrac{1}{2}$
- Comes from the second-order term of a Taylor series, same as $f''(a) h^2 / 2$ in 1D.
Geometric reading. Near $\mathbf{x}$, the function looks like a tilted quadratic bowl. The tilt comes from the gradient; the shape of the bowl — whether it's a well, a dome, or a saddle — comes from the Hessian. Which is exactly what classifies critical points in §10.
Eigenvalues of $\mathbf{H}_f$ tell you the principal curvatures — the amounts by which $f$ curves along the most- and least-curved directions — and their eigenvectors tell you those directions. A positive eigenvalue means "bowl-up in that direction," a negative one means "dome-down," a zero eigenvalue means "locally flat." Second-order optimization methods (Newton's method, natural gradient) use the Hessian directly; first-order methods (plain SGD, Adam) only approximate it.
9. Interactive: gradient field viewer
Drag the pink point anywhere on the plane. The heatmap shows $f(x, y) = \sin(x) \cos(y) + 0.1(x^2 + y^2)$. The arrow at the point is the gradient $\nabla f$ (rescaled for visibility). The black curves are level curves — notice how the gradient arrow is always perpendicular to the level curve it's sitting on, and points uphill (toward brighter regions).
Point at $(0.0, 0.0)$ — function value and gradient shown.
Three things to play with. (1) Drag the dot toward a bright spot: the arrow should shrink to zero at the peak. (2) Drag it onto a level curve and notice how the arrow is always at right angles to the curve. (3) Park it on a saddle — between two peaks — and you'll see the arrow point away from whichever direction is locally higher. That's the geometry of the chain rule and of gradient descent in one figure.
10. Critical points & the second-derivative test
A critical point of a scalar field is a point where the gradient vanishes: $\nabla f(\mathbf{x}^*) = \mathbf{0}$. In single-variable calculus, that's a point with horizontal tangent — minimum, maximum, or inflection. In multivariable land, there's a new beast: the saddle point, where $f$ goes up in one direction and down in another.
The second-derivative test via Hessian eigenvalues
At a critical point, the gradient term in the Taylor expansion is zero, so the shape of $f$ to leading order is dictated entirely by the quadratic form $\tfrac{1}{2} \mathbf{v}^\top \mathbf{H} \mathbf{v}$. What happens to that quadratic form depends on the signs of the Hessian's eigenvalues:
- All eigenvalues positive ($\mathbf{H}$ positive definite): $\mathbf{x}^*$ is a local minimum. The function curves up in every direction. Bowl.
- All eigenvalues negative ($\mathbf{H}$ negative definite): $\mathbf{x}^*$ is a local maximum. Curves down everywhere. Dome.
- Mixed signs (some positive, some negative): $\mathbf{x}^*$ is a saddle point. Up in some directions, down in others.
- Some zero eigenvalue: test is inconclusive. You need higher-order information.
For two variables there's a well-known shortcut. Let $H = \det \mathbf{H}_f = f_{xx} f_{yy} - f_{xy}^2$. Then:
2D second-derivative test
- $f_{xx}, f_{yy}, f_{xy}$
- The three second partials — short-hand for $\partial^2 f / \partial x^2$, $\partial^2 f / \partial y^2$, $\partial^2 f / \partial x \partial y$.
- $H$
- The determinant of the $2 \times 2$ Hessian. Positive $H$ means the eigenvalues have the same sign; negative $H$ means opposite signs (saddle).
- $f_{xx} > 0$
- Curves up in the $x$-direction. Combined with $H > 0$, that forces both eigenvalues to be positive.
Why it works. For a $2 \times 2$ symmetric matrix, $\det = \lambda_1 \lambda_2$ and $\text{tr} = \lambda_1 + \lambda_2 = f_{xx} + f_{yy}$. Same-sign eigenvalues $\Leftrightarrow \det > 0$; positive eigenvalues $\Leftrightarrow \det > 0$ and $\text{tr} > 0$, and the trace sign matches $f_{xx}$'s sign when $\det > 0$.
Worked example
Find and classify critical points of $f(x,y) = x^3 - 3xy + y^3$. First the gradient:
Gradient of the cubic
- $3x^2 - 3y$
- Partial w.r.t. $x$: differentiate $x^3 \to 3x^2$, $-3xy \to -3y$, $y^3 \to 0$.
- $-3x + 3y^2$
- Partial w.r.t. $y$: $x^3 \to 0$, $-3xy \to -3x$, $y^3 \to 3y^2$.
Set both components to zero: $3x^2 - 3y = 0$ gives $y = x^2$; substitute into $-3x + 3y^2 = 0$ to get $x = y^2 = x^4$, so $x(x^3 - 1) = 0$, which means $x = 0$ or $x = 1$. Critical points: $(0,0)$ and $(1,1)$. Now compute the Hessian:
Hessian of the cubic
- $6x$
- Second partial $\partial^2 f / \partial x^2$ — differentiate $3x^2 - 3y$ again in $x$.
- $-3$
- Mixed partial — $\partial/\partial y$ of $3x^2 - 3y$, or equivalently $\partial / \partial x$ of $-3x + 3y^2$.
- $6y$
- Second partial $\partial^2 f / \partial y^2$.
At $(0,0)$: $\mathbf{H} = \begin{smallmatrix} 0 & -3 \\ -3 & 0 \end{smallmatrix}$, $\det = -9 < 0$, so this is a saddle point. At $(1,1)$: $\mathbf{H} = \begin{smallmatrix} 6 & -3 \\ -3 & 6 \end{smallmatrix}$, $\det = 36 - 9 = 27 > 0$ and $f_{xx} = 6 > 0$, so this is a local minimum. Done.
Saddles in high dimensions. In $n$ dimensions, a critical point is a local minimum only if all $n$ eigenvalues are positive. In a billion-dimensional loss landscape, the chance of that happening at a random critical point is astronomically small — most critical points are saddles. This is a key reason modern neural networks train at all: gradient descent doesn't get stuck in bad local minima, it gets stuck on saddles and slopes, and momentum-based methods (Adam, momentum SGD) are very good at sliding off them.
11. Multiple integrals
Single-variable calculus integrates over intervals. Multivariable calculus integrates over regions, surfaces, and volumes. The mechanics are the same — you partition, sample, and take a limit — but the bookkeeping is heavier.
Double integrals and iterated integration
For a region $R \subset \mathbb{R}^2$ and a scalar field $f(x,y)$, the double integral is the (signed) volume under the surface $z = f(x,y)$ above $R$:
Double integral as iterated integral
- $R$
- A region in the plane, perhaps described as "$a \le x \le b$ and $c(x) \le y \le d(x)$."
- $dA$
- An infinitesimal area element — the 2D analogue of $dx$. In $(x,y)$ coordinates, $dA = dx \, dy$.
- $\iint_R$
- "Double integral over $R$." The two integral signs mean we integrate in two dimensions.
- $\int_{c(x)}^{d(x)} \, dy$
- Inner integral: integrate out $y$ first, treating $x$ as fixed. Result is a function of $x$.
- $\int_a^b \, dx$
- Outer integral: integrate the result over $x$. Turns the function of $x$ into a number.
Fubini's theorem. For any reasonable integrand and region, you can do the two integrals in either order and get the same answer. "Reasonable" covers essentially everything on this site.
Triple integrals work the same way, with one more level of nesting: $\iiint_V f \, dV$, $dV = dx \, dy \, dz$, and three iterated limits.
Change of variables and the Jacobian determinant
Sometimes the region is awkward in $(x,y)$ coordinates but simple in some other coordinates — polar, spherical, or a custom transformation. When you change variables from $(x,y)$ to $(u,v)$ via $\mathbf{x} = \mathbf{g}(u, v)$, the area element transforms by the absolute value of the Jacobian determinant:
Change of variables in 2D
- $\mathbf{g}$
- The coordinate transformation: $\mathbf{g}(u,v) = (x(u,v), y(u,v))$.
- $R'$
- The pre-image of $R$ under $\mathbf{g}$ — the same region, described in $(u,v)$ coordinates.
- $\mathbf{J}_{\mathbf{g}}$
- The $2 \times 2$ Jacobian matrix of $\mathbf{g}$. Its determinant measures how much $\mathbf{g}$ stretches or squishes area near the point.
- $|\det \mathbf{J}_{\mathbf{g}}|$
- Absolute value of the determinant. The absolute value is there because orientation doesn't matter for measuring area.
Example. Polar coordinates: $x = r\cos\theta$, $y = r\sin\theta$, and $\det \mathbf{J} = r$. That's where the $r \, dr \, d\theta$ you memorized comes from — it is literally the Jacobian determinant of the polar transformation.
This same construction shows up in probability theory as the change-of-variables formula for densities, and in normalizing-flow generative models: a bijective neural network $\mathbf{g}$ transforms a simple density into a complex one, and the Jacobian determinant of $\mathbf{g}$ is what lets you compute the transformed density exactly.
12. Lagrange multipliers
Often you don't want the unconstrained extremum of $f$ — you want the extremum subject to some constraint, like "maximize $f(x, y)$ subject to $g(x, y) = 0$." Lagrange multipliers turn this into an unconstrained problem in one more variable.
Lagrange condition
- $f$
- The objective function you want to optimize.
- $g$
- The constraint function. The constraint is "$g = 0$" — a level curve of $g$.
- $\mathbf{x}^*$
- A candidate optimum on the constraint surface.
- $\lambda$
- The Lagrange multiplier — an unknown scalar you solve for alongside $\mathbf{x}^*$. Its sign tells you which way along the constraint the objective is increasing.
- $\nabla f = \lambda \nabla g$
- "The gradient of $f$ is parallel to the gradient of $g$." Geometrically: the level curve of $f$ is tangent to the constraint curve at the optimum.
Why it works. If the level curve of $f$ crossed the constraint transversally at $\mathbf{x}^*$, you could walk along the constraint in one direction and increase $f$. So at an optimum, the two curves must be tangent, which is the same as saying their normals (i.e., their gradients) are parallel. That parallelism is exactly $\nabla f = \lambda \nabla g$.
Worked example
Maximize $f(x, y) = xy$ subject to $g(x, y) = x + y - 1 = 0$. Gradients:
Gradients of objective and constraint
- $(y, x)$
- Gradient of $xy$: partial w.r.t. $x$ is $y$, partial w.r.t. $y$ is $x$.
- $(1, 1)$
- Gradient of the linear constraint $x + y - 1$: both partials are 1.
The Lagrange condition $\nabla f = \lambda \nabla g$ gives $y = \lambda$ and $x = \lambda$, so $x = y$. Combined with the constraint $x + y = 1$, we get $x = y = 1/2$, and the constrained maximum value is $f(1/2, 1/2) = 1/4$. That's the tightest rectangle-area you can get with a fixed perimeter — the square.
With multiple constraints $g_1 = 0, g_2 = 0, \ldots$, the condition generalizes to $\nabla f = \sum_i \lambda_i \nabla g_i$: the objective's gradient is a linear combination of the constraint gradients. You get one multiplier per constraint, and this is exactly the machinery behind support vector machines, constrained portfolio optimization, and KKT conditions for inequality-constrained optimization (see the optimization page).
13. Code: numerical gradients, Jacobians, Hessians
Now let's get our hands dirty. The code below computes gradients, Jacobians, and Hessians two ways: a finite-difference approximation (no libraries beyond NumPy) and a sketched reverse-mode autodiff — the technique under every real-world deep learning framework.
import numpy as np
def numerical_gradient(f, x, h=1e-5):
# f: R^n -> R, x: shape (n,). Returns grad shape (n,).
# Central differences for O(h^2) accuracy.
x = np.asarray(x, dtype=float)
grad = np.zeros_like(x)
for i in range(x.size):
xp = x.copy(); xp[i] += h
xm = x.copy(); xm[i] -= h
grad[i] = (f(xp) - f(xm)) / (2 * h)
return grad
# Example: f(x,y) = x^2 + 3 x y
def f(v): return v[0]**2 + 3 * v[0] * v[1]
g = numerical_gradient(f, [1.0, 2.0])
print(g) # [8. 3.] — matches analytic answer from §5
# Gradient descent in one line:
x = np.array([3.0, 4.0])
for _ in range(50):
x = x - 0.1 * numerical_gradient(f, x)
print(x) # converges to a stationary point of f
import numpy as np
def numerical_jacobian(F, x, h=1e-5):
# F: R^n -> R^m, x: shape (n,). Returns J shape (m, n).
x = np.asarray(x, dtype=float)
F0 = np.asarray(F(x), dtype=float)
n, m = x.size, F0.size
J = np.zeros((m, n))
for j in range(n):
xp = x.copy(); xp[j] += h
xm = x.copy(); xm[j] -= h
J[:, j] = (np.asarray(F(xp)) - np.asarray(F(xm))) / (2*h)
return J
def numerical_hessian(f, x, h=1e-4):
# f: R^n -> R. Returns H shape (n, n). Symmetric by construction.
x = np.asarray(x, dtype=float)
n = x.size
H = np.zeros((n, n))
for i in range(n):
for j in range(n):
xpp = x.copy(); xpp[i] += h; xpp[j] += h
xpm = x.copy(); xpm[i] += h; xpm[j] -= h
xmp = x.copy(); xmp[i] -= h; xmp[j] += h
xmm = x.copy(); xmm[i] -= h; xmm[j] -= h
H[i, j] = (f(xpp) - f(xpm) - f(xmp) + f(xmm)) / (4 * h * h)
return 0.5 * (H + H.T) # force symmetry to clean up noise
# Example from §7:
def F(v): return np.array([v[0]**2 * v[1], v[0] + np.sin(v[1])])
print(numerical_jacobian(F, [1.0, 0.0]))
# [[0. 1.] row 1: d/dx (x^2 y)=2xy=0, d/dy (x^2 y)=x^2=1
# [1. 1.]] row 2: d/dx (x+sin y)=1, d/dy (x+sin y)=cos(0)=1
import numpy as np
# A cartoon reverse-mode autodiff: tiny tape of ops.
# Each Node holds a value and a list of (parent, local_derivative) pairs.
# A backward pass applies the chain rule from output back to inputs.
class Node:
def __init__(self, val, parents=()):
self.val = float(val)
self.parents = parents # list of (parent_node, local_grad)
self.grad = 0.0
def __add__(a, b):
return Node(a.val + b.val, [(a, 1.0), (b, 1.0)])
def __mul__(a, b):
return Node(a.val * b.val, [(a, b.val), (b, a.val)])
def sin(a):
return Node(np.sin(a.val), [(a, np.cos(a.val))])
def backward(output):
# Topological order via DFS.
order, seen = [], set()
def visit(n):
if id(n) in seen: return
seen.add(id(n))
for p, _ in n.parents: visit(p)
order.append(n)
visit(output)
output.grad = 1.0
for n in reversed(order):
for p, local in n.parents:
p.grad += n.grad * local # <-- THIS is the chain rule
# Compute f(x, y) = x*y + sin(x), and grad at (x=2, y=3).
x, y = Node(2.0), Node(3.0)
z = x * y + x.sin()
backward(z)
print(x.grad, y.grad) # y + cos(x), x → 3 + cos(2), 2
Three points about that autodiff sketch. First, each primitive op (add, mul, sin) locally knows its own derivative — that's the local gradient in §6's chain rule. Second, the backward pass computes one sum per edge (each "$p.\text{grad} \mathrel{+}= n.\text{grad} \cdot \text{local}$") — that's the chain-rule sum, with the "$+{=}$" handling the case where one node has multiple parents. Third, the whole thing costs roughly one forward pass plus one backward pass — not one forward pass per input. That's the efficiency gain that makes training real networks feasible.
14. Back to machine learning
Everything on this page is in service of one picture. A neural network is a function
A generic supervised loss
- $\mathcal{L}(\theta)$
- The scalar training loss as a function of the parameter vector $\theta$. This is the scalar field we're descending.
- $\theta$
- The parameter vector — all weights and biases stacked into one long vector in $\mathbb{R}^n$ with $n$ potentially in the billions.
- $N$
- The number of training examples in a batch.
- $(\mathbf{x}_i, \mathbf{y}_i)$
- The $i$-th training input and its target. $\mathbf{x}_i$ might be an image, $\mathbf{y}_i$ its label.
- $\text{net}_\theta$
- The network — a composition of many layers, each parameterised by a slice of $\theta$.
- $\ell$
- A per-example loss (cross-entropy, squared error, etc.). Scalar-valued.
So what do we do with it? Compute $\nabla_\theta \mathcal{L}$ (a vector the size of $\theta$), take a step in the direction $-\nabla_\theta \mathcal{L}$ (steepest descent, §4), repeat. That's the whole algorithm — gradient descent. Every improvement (momentum, Adam, second-order methods, natural gradient) is a modification to this loop that uses more information about the loss landscape to take better steps.
The multivariable calculus content maps onto ML as follows:
- Gradient (§4): What you descend. Computed by backprop, consumed by the optimizer.
- Chain rule (§6): The algorithm by which gradients are computed. Backpropagation is reverse-mode autodiff applied to the network's computation graph.
- Jacobian (§7): The local linear map of a single layer. Stacking Jacobians is how gradients flow through a network. This is also what makes attention differentiable end-to-end.
- Hessian (§8): Curvature of the loss. Second-order optimizers (K-FAC, natural gradient, Shampoo) use approximations to it; even Adam is secretly estimating diagonal curvature.
- Loss landscape (§10): In billions of dimensions, the landscape is dominated by saddles, not local minima. Training mostly means escaping saddles and navigating flat valleys, not finding the "global" optimum — there usually isn't a useful one.
- Multiple integrals & Jacobian determinant (§11): Show up in probability (change of variables for densities), in normalizing flows, and in computing expected values over continuous distributions.
- Lagrange multipliers (§12): Under the hood of constrained optimization — SVMs, trust-region methods, KKT conditions, and anywhere you've seen "dual variables."
If you internalize only one thing: every neural network you'll ever train is a scalar field in high-dimensional parameter space, and everything we do to it is multivariable calculus.
15. Further reading
Further reading
- Marsden & Tromba — Vector Calculus (6th ed., 2011). The canonical undergraduate textbook. Slow, visual, and patient.
- James Stewart — Calculus: Early Transcendentals. The standard big-book; Chapters 14–16 cover everything on this page at a gentler pace.
- Manfredo do Carmo — Differential Geometry of Curves and Surfaces. Where to go after multivariable calculus if you liked the geometric flavour of the gradient and level sets.
- Hubbard & Hubbard — Vector Calculus, Linear Algebra, and Differential Forms: A Unified Approach. Harder, deeper, and uses linear algebra throughout — closer in spirit to what you need for ML.
- Baydin, Pearlmutter, Radul & Siskind (2018) — Automatic Differentiation in Machine Learning: a Survey. The definitive modern reference for how the chain rule gets automated.
- Goodfellow, Bengio & Courville — Deep Learning, Chapter 4 (Numerical Computation) and Chapter 6 (Feedforward Networks). Re-reads the chain rule in ML language.