Numerical Analysis

Computers cannot represent most real numbers exactly. Every floating-point computation accumulates error. Numerical analysis is the field that makes math work on real hardware — how to solve equations, integrate functions, and factor matrices without the roundoff eating you alive.

Prereq: calculus, linear algebra Read time: ~30 min Interactive figures: 1 Code: NumPy

1. Why numerical analysis

Open a Python shell and type 0.1 + 0.2. You will not get 0.3. You will get 0.30000000000000004. This is not a bug in Python, or in your CPU, or in IEEE 754. It is a direct consequence of something you already know from decimal arithmetic: you cannot write $1/3$ with finitely many decimal digits. Binary has the same problem, and $0.1$ is one of its victims — in binary, $0.1$ is a repeating fraction that has to be rounded somewhere.

Once you accept that real numbers get rounded on every single operation, you have to start asking uncomfortable questions. If I subtract two floats, how bad can the error get? If I solve a linear system with Gaussian elimination, does the answer mean anything? If I run gradient descent for a million steps, does the accumulated error matter? If I train a neural network in FP16 instead of FP32, do the gradients still point the right way?

Numerical analysis is the field that answers those questions. It is the bridge between the clean, exact math you learned in calculus and linear algebra and the messy, approximate math your CPU and GPU actually perform. It teaches you:

How to reason about error — where it comes from, how it propagates, and when you should worry.
How to choose algorithms — two algorithms can be mathematically equivalent and numerically very different. One gives you 12 correct digits, the other gives you 2.
How to read a matrix — some matrices amplify error by a factor of $10^{15}$ just because of their structure, and there is nothing any algorithm can do about it. You should know that before you start solving.
How to set tolerances — when you write while error > 1e-8, you need to know whether $10^{-8}$ is even reachable in your precision.

This matters everywhere modern software touches continuous math. Every physics simulation, every rendering engine, every solver in a CAD tool, every ML training run, every weather forecast, every control loop in a drone — all of it rests on numerical analysis whether the programmer knows it or not. Usually it quietly works. Occasionally it falls over in spectacular ways, and then someone has to know this material.

THE PATRIOT MISSILE STORY

On 25 February 1991, a Patriot missile battery at Dhahran failed to intercept an incoming Scud. 28 soldiers died. The cause was numerical: the battery's internal clock ticked in tenths of a second, stored as a 24-bit fixed-point number. After 100 hours of uptime, roundoff in the $0.1$ representation had accumulated a $0.34$ second drift. At Scud speeds, that was 600 meters. The radar looked in the wrong place. This page is partly about how to not do that.

2. Vocabulary cheat-sheet

A handful of symbols show up on every page of any numerical analysis textbook. Memorize these now and the rest of the page will feel easier.

$\tilde{x}$ — an approximation of the true value $x$ (the tilde means "computed, not exact").
$\varepsilon_{\text{mach}}$ — machine epsilon, the gap between $1$ and the next representable float. About $2.2 \times 10^{-16}$ in double precision.
$\text{fl}(x)$ — the floating-point version of $x$, rounded to the nearest representable value.
$\kappa(A)$ — the condition number of matrix $A$. Measures how much $A$ amplifies small errors.
$\|\cdot\|$ — a norm, usually 2-norm (Euclidean) or $\infty$-norm (max absolute value).
$\mathcal{O}(h^p)$ — "grows like $h^p$", where $h$ is a step size. If a method is $\mathcal{O}(h^2)$, halving $h$ divides error by 4.
$f'(x)$, $f''(x)$ — first and second derivatives of $f$ at $x$.

3. Floating-point representation

Every real number your computer touches lives in a box of 16, 32, or 64 bits. Inside that box is a tiny binary scientific-notation expression, specified by the IEEE 754 standard (1985, revised 2008, revised again 2019):

x = (-1)^s \cdot (1.m_1 m_2 \dots m_p)_2 \cdot 2^{e - \text{bias}}

IEEE 754 float layout

$s$: Sign bit — 1 bit, 0 means positive, 1 means negative.
$1.m_1 m_2 \dots m_p$: The significand (also called mantissa), written in binary. There is an implicit leading 1 that does not take up a bit — the stored bits are just the $m_i$. $p$ is the number of explicitly stored bits.
$e$: The biased exponent — an unsigned integer. It is biased so that small $e$ can represent both very small and very large numbers.
$\text{bias}$: A fixed offset (127 for FP32, 1023 for FP64) so that $e - \text{bias}$ can be negative. Think of it as shifting the zero point so the exponent field is always non-negative.
$p$ and total bit count: FP64 (double): 1 sign, 11 exponent, 52 mantissa. FP32 (float): 1, 8, 23. FP16 (half): 1, 5, 10. BF16: 1, 8, 7. FP8 (E4M3): 1, 4, 3.

Intuition. It is scientific notation, in binary, with a fixed number of digits. Just like you cannot write $\pi$ with 5 decimal digits of scientific notation without rounding, you cannot write $0.1$ with 52 binary digits without rounding. The rounding error is always there. The size of the error depends on how many mantissa bits you have.

The consequence of having a fixed number of mantissa bits is that the representable numbers are not evenly spaced on the real line. They are dense near zero and get sparser as you move away. The gap between consecutive floats near a value $x$ is roughly $|x| \cdot \varepsilon_{\text{mach}}$, where $\varepsilon_{\text{mach}}$ is machine epsilon.

\varepsilon_{\text{mach}} = 2^{-p}

Machine epsilon

$\varepsilon_{\text{mach}}$: The smallest positive number such that $\text{fl}(1 + \varepsilon_{\text{mach}}) \ne 1$. In other words, the gap between $1.0$ and the next representable float above it, divided by 1.
$p$: The number of stored mantissa bits. FP64 has $p = 52$, so $\varepsilon_{\text{mach}} \approx 2.22 \times 10^{-16}$. FP32 has $p = 23$, so $\varepsilon_{\text{mach}} \approx 1.19 \times 10^{-7}$.

Why you care. Machine epsilon is the best relative accuracy you can ever hope to get from a single operation. Any algorithm that claims to compute $x$ to better than $\varepsilon_{\text{mach}} \cdot |x|$ absolute accuracy is lying, or using higher precision internally, or both.

Every basic floating-point operation ($+$, $-$, $\times$, $\div$, $\sqrt{\cdot}$) is guaranteed by IEEE 754 to produce a result that is the correctly rounded answer. In symbols:

\text{fl}(x \oplus y) = (x \oplus y)(1 + \delta), \quad |\delta| \le \varepsilon_{\text{mach}}

Fundamental floating-point axiom

$x, y$: Two floating-point inputs (already representable).
$\oplus$: Any of the four basic arithmetic operations ($+, -, \times, \div$) performed in exact real arithmetic.
$\text{fl}(\cdot)$: Round to the nearest float, with ties going to even.
$\delta$: The relative rounding error — small, bounded, but unknown in sign.
$\varepsilon_{\text{mach}}$: Machine epsilon from above.

Analogy. Every arithmetic operation is a tiny perturbation by up to $\varepsilon_{\text{mach}}$ of the true answer. A single operation is fine. Ten thousand operations in sequence can accumulate — but usually do not, because the errors are random and partially cancel. It is only when you are systematically biased (like summing a million positive numbers, or subtracting two near-equal values) that you lose real precision.

Why $0.1 + 0.2 \ne 0.3$

In binary, $0.1$ is the repeating fraction $0.000110011001100\ldots_2$. Stored as an FP64 value, it gets rounded after 52 mantissa bits. So does $0.2$. So does $0.3$. The rounded $0.1$ plus the rounded $0.2$ does not happen to equal the rounded $0.3$ — it is higher by one ULP (unit in the last place), which in FP64 is about $5.5 \times 10^{-17}$. Your comparison fails.

LESSON

Never compare floats for equality. Compare with a tolerance: abs(a - b) < tol, where tol should be chosen based on both machine epsilon and the magnitudes of a and b.

ML precision zoo

Classical scientific computing lives in FP64. ML has drifted in the other direction, because training a giant model is bottlenecked on memory bandwidth and matrix-multiply throughput, not on last-digit accuracy. The relevant formats:

Format	Sign	Exp	Mantissa	Range	$\varepsilon_{\text{mach}}$	Use
FP64 (double)	1	11	52	$\sim 10^{\pm 308}$	$2.2 \times 10^{-16}$	Scientific computing, ground truth
FP32 (single)	1	8	23	$\sim 10^{\pm 38}$	$1.2 \times 10^{-7}$	Graphics, legacy ML
TF32	1	8	10	$\sim 10^{\pm 38}$	$9.8 \times 10^{-4}$	NVIDIA A100 matmul accumulator
FP16 (half)	1	5	10	$\sim 6.5 \times 10^{4}$	$9.8 \times 10^{-4}$	Mixed-precision training / inference
BF16 (bfloat16)	1	8	7	$\sim 10^{\pm 38}$	$7.8 \times 10^{-3}$	Modern LLM training (TPU/H100)
FP8 E4M3	1	4	3	$\sim 448$	$0.125$	H100 forward / weights
FP8 E5M2	1	5	2	$\sim 5.7 \times 10^{4}$	$0.25$	H100 gradients

The pattern: as you shrink the format, you give up either range (how big or small the numbers can be) or precision (how finely you resolve them). BF16 keeps FP32's range and sacrifices precision — it turns out that for ML gradients, range matters more than mantissa bits, because gradient values span many orders of magnitude but do not need 7 digits of accuracy. This is why BF16 replaced FP16 for LLM training around 2020.

4. Error types — absolute, relative, forward, backward

When you have an approximation $\tilde{x}$ of a true value $x$, the two most basic measurements of how wrong it is are:

\text{absolute error} = |\tilde{x} - x|, \qquad \text{relative error} = \frac{|\tilde{x} - x|}{|x|}

Absolute and relative error

$x$: The exact, true answer (which in practice you usually do not know — that is the whole point).
$\tilde{x}$: The approximation your algorithm actually produced.
Absolute: How far off you are, in the same units as $x$.
Relative: How far off you are as a fraction of the thing you are measuring. Dimensionless. For floating-point, this is usually what you want — being off by $10^{-6}$ is terrible if $x = 10^{-8}$ and excellent if $x = 10^{10}$.

Why both. Absolute error is the right metric when $x$ is small or zero (the relative metric blows up). Relative error is the right metric when comparing across scales. Most algorithms get judged on relative error, but you will see both.

The subtler distinction — and the one that separates numerical analysts from everyone else — is forward versus backward error. Suppose you want to compute $y = f(x)$, and your algorithm gives you $\tilde{y}$ instead.

Forward error: $|\tilde{y} - y|$. The gap between what you got and the right answer. Obvious.
Backward error: the smallest perturbation $\Delta x$ such that $\tilde{y} = f(x + \Delta x)$ exactly. In other words — what problem did you actually solve? You solved a nearby one. How nearby?

Backward stability. An algorithm for computing $f(x)$ is backward stable if the answer it returns is the exact answer for some slightly perturbed input. The size of the needed perturbation is the backward error, and a good algorithm keeps it at the level of $\varepsilon_{\text{mach}}$ — no algorithm can do better, because the input itself got rounded the moment it was stored.

Why this matters: backward error is usually much easier to bound than forward error, and it cleanly separates "the algorithm did its job" from "the problem itself was hopeless." If the backward error is tiny but the forward error is huge, the problem was ill-conditioned (next section). Blame the problem, not the code.

Roundoff versus truncation

There are two fundamentally different sources of numerical error:

Roundoff error: the floats cannot represent the exact real numbers, so every operation rounds. Bounded by $\varepsilon_{\text{mach}}$ per operation.
Truncation error: the algorithm approximates an infinite process (a derivative, an integral, an infinite series) by a finite one. Bounded by how "good" the approximation is — usually some power of a step size $h$.

These go in opposite directions as you increase effort. Smaller $h$ lowers truncation error but raises roundoff (because you subtract more near-equal numbers). There is an optimal $h$ somewhere in between, and finding it is a recurring theme — most memorably in numerical differentiation, which we'll get to.

5. Catastrophic cancellation

The single biggest source of real-world numerical disasters is subtracting two floating-point numbers that are nearly equal. When you do, almost all the leading digits cancel and what you are left with is dominated by the rounding error that was sitting in the last few bits of each input.

Say $a = 1.23456789012345$ and $b = 1.23456789012340$. Both are stored to 15 significant figures. The true difference is $5 \times 10^{-14}$. But each input had roundoff error at the $10^{-14}$ level in its last digit, and that error did not cancel — now it dominates the answer. You have lost 14 digits of precision in one subtraction.

RULE

Every significant bit of precision you want out of a subtraction must be present beyond the bits that cancelled. Subtracting two numbers that agree in the first $k$ binary digits loses $k$ bits of precision, period.

Worked example: the quadratic formula

You learned in school to solve $ax^2 + bx + c = 0$ with:

x_{1,2} = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}

Quadratic formula (naive)

$a, b, c$: The coefficients of the polynomial $ax^2 + bx + c$.
$x_{1,2}$: The two roots — there are always two, possibly complex, possibly repeated.
$\sqrt{b^2 - 4ac}$: The discriminant. Positive for two real roots, zero for a double root, negative for complex conjugates.

The catch. If $b$ is large and $ac$ is small, then $\sqrt{b^2 - 4ac}$ is very close to $|b|$. For the root where you add $\sqrt{b^2-4ac}$ to $-b$ with the same sign, you cancel. Your answer for that root is garbage.

Concrete: take $a = 1$, $b = 10^8$, $c = 1$. Mathematically the roots are very close to $-10^8$ and $-10^{-8}$. The first is fine. For the second, you compute $-b + \sqrt{b^2 - 4ac} = -10^8 + \sqrt{10^{16} - 4}$. In FP64, $\sqrt{10^{16} - 4}$ rounds to exactly $10^8$, so you get $-10^8 + 10^8 = 0$. The computed root is $0$ instead of $-10^{-8}$. The relative error is infinite.

The fix is to use a mathematically equivalent rewrite that dodges the cancellation. Rationalize:

x = \frac{-b + \sqrt{b^2 - 4ac}}{2a} \cdot \frac{-b - \sqrt{b^2 - 4ac}}{-b - \sqrt{b^2 - 4ac}} = \frac{2c}{-b - \sqrt{b^2 - 4ac}}

Stable quadratic formula

Left-hand side: One of the two roots, rewritten by multiplying numerator and denominator by the conjugate $(-b - \sqrt{b^2 - 4ac})$.
$2c / (-b - \sqrt{b^2 - 4ac})$: Now both terms in the denominator have the same sign (assuming $b > 0$), so they add instead of subtracting. No cancellation.

The trick. Pick the sign in $\pm$ so that $-b \mp \sqrt{b^2 - 4ac}$ has no cancellation — i.e., so the two terms have the same sign. Then get the other root from $x_1 x_2 = c/a$ (Vieta's formula). This recipe gives you 15+ correct digits on both roots.

The general moral: if you find yourself subtracting two nearly-equal numbers, stop, and ask whether there is an algebraic rewrite that does the same computation without subtracting. There almost always is. Taylor series, exponentials, logs, and trigonometric identities all have stable rewrites you should learn on the job.

6. Conditioning and stability

Conditioning is about the problem. Stability is about the algorithm. They are different and people confuse them constantly.

A problem is well-conditioned if small changes in the input cause small changes in the output. Ill-conditioned if they do not. Ill-conditioning is a property of the mathematics, not the code.
An algorithm is stable if it does not itself amplify errors much beyond the unavoidable. An algorithm can be stable on a well-conditioned problem and still give garbage on an ill-conditioned one — because the ill-conditioning is not its fault.

For a function $f$, the condition number at input $x$ measures how much a relative change in $x$ gets amplified in $f(x)$:

\kappa_f(x) = \lim_{\delta \to 0} \sup_{|\Delta x| \le \delta} \frac{|f(x + \Delta x) - f(x)| / |f(x)|}{|\Delta x| / |x|}

Condition number of a function

$\kappa_f(x)$: The function's condition number at the point $x$. Dimensionless.
$\Delta x$: A small perturbation of the input.
Numerator: The relative change in output caused by the perturbation.
Denominator: The relative change in input.
$\sup$ and $\lim$: "Worst case over all small enough perturbations" — the supremum as $\delta \to 0$ is the steepest amplification direction.

Rule of thumb. A condition number of $10^k$ means you lose about $k$ digits of precision. If $\kappa \approx 10^{12}$ and you compute in FP64 (which has $\sim 16$ digits), you should expect about $16 - 12 = 4$ correct digits in the output. If $\kappa \approx 10^{16}$, you get noise.

For smooth $f$ of one variable, there is a closed form: $\kappa_f(x) = |x f'(x) / f(x)|$. So for $f(x) = \sqrt{x}$, $\kappa = 1/2$ (well-conditioned). For $f(x) = \tan(x)$ near $\pi/2$, $\kappa \to \infty$ (ill-conditioned — a tiny change in $x$ blows $\tan(x)$ up).

Condition number of a matrix

For a linear system $Ax = b$, the condition number of $A$ tells you how much relative error in $b$ can contaminate your answer for $x$:

\kappa(A) = \|A\| \cdot \|A^{-1}\|

Matrix condition number

$A$: A square invertible matrix.
$\|A\|$: An operator norm of $A$ — usually the 2-norm, which equals the largest singular value $\sigma_{\max}(A)$.
$\|A^{-1}\|$: The operator norm of the inverse. In the 2-norm, this is $1/\sigma_{\min}(A)$.
$\kappa(A)$: In the 2-norm, $\kappa_2(A) = \sigma_{\max}(A) / \sigma_{\min}(A)$ — the ratio of the largest singular value to the smallest. $\kappa(A) \ge 1$ always, with equality for orthogonal matrices.

Why you care. If $\kappa(A) = 10^{10}$ and your $b$ has $10^{-16}$ relative rounding error, your computed $x$ can be off by $10^{-6}$ relative — that is only six correct digits, and no amount of algorithmic care can fix it. Gaussian elimination did its job perfectly. The matrix is what is broken.

Ill-conditioned matrices come up more than you would expect. Vandermonde matrices (from polynomial interpolation at evenly spaced points) have $\kappa \sim 10^{10}$ at size 20. The Hilbert matrix $H_{ij} = 1/(i+j-1)$ has $\kappa$ that grows like $\sim (1+\sqrt{2})^{4n}$ — the $10 \times 10$ Hilbert matrix has $\kappa \approx 10^{13}$. Any matrix close to singular is ill-conditioned. Any design matrix in a regression with nearly-collinear features is ill-conditioned. This is why ridge regression (Tikhonov regularization) works — it artificially inflates the small singular values, trading bias for conditioning.

7. Root-finding

Given a function $f$, find an $x$ such that $f(x) = 0$. This is the workhorse primitive. You reduce equation-solving to this form, and then pick an algorithm based on what you know about $f$.

Bisection — the bulletproof method

If you know a sign change — $f(a) < 0$ and $f(b) > 0$ for some interval $[a, b]$ — and $f$ is continuous, then by the intermediate value theorem there is a root in $[a, b]$. Bisection: evaluate $f$ at the midpoint $c = (a+b)/2$, and replace $a$ or $b$ with $c$ depending on the sign. The interval halves. Repeat.

After $n$ steps, the interval has width $(b - a)/2^n$. To get $k$ bits of accuracy you need $k$ steps. This is linear convergence — one more bit of accuracy per step. Slow but absolutely guaranteed. Bisection cannot fail; it cannot miss the root; it is the method you fall back on when fancier methods misbehave.

Newton's method — the fast one

If you can evaluate both $f$ and its derivative $f'$, Newton's method takes your current guess $x_n$ and draws the tangent line to $f$ at $(x_n, f(x_n))$. Where that tangent line crosses zero becomes $x_{n+1}$:

x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}

Newton's method

$x_n$: The current iterate — your guess after $n$ steps.
$x_{n+1}$: The next iterate.
$f(x_n)$: The value of $f$ at the current guess. Zero when you are done.
$f'(x_n)$: The derivative at the current guess — the slope of the tangent line.
$-f(x_n)/f'(x_n)$: The horizontal distance from the current guess to where the tangent line crosses the x-axis. Subtract it from $x_n$ and you are at that crossing point.

Derivation. The tangent line at $x_n$ is $y = f(x_n) + f'(x_n)(x - x_n)$. Set $y = 0$ and solve for $x$: $x = x_n - f(x_n)/f'(x_n)$. That is the next guess. Geometrically: pretend $f$ is locally linear, find where the linear approximation hits zero, move there. Repeat.

Newton's method converges quadratically when it works: if $x_n$ is close enough to a simple root $x^*$, the error roughly squares each step:

|x_{n+1} - x^*| \approx C \cdot |x_n - x^*|^2

Quadratic convergence

$x^*$: The true root. $f(x^*) = 0$.
$|x_n - x^*|$: The error of iterate $n$.
$C$: A constant depending on $f$ — specifically $C = |f''(x^*)| / (2|f'(x^*)|)$ for simple roots.
Squared: If you have 3 correct digits, one more step gives you 6. Then 12. Then 24 — except you hit machine epsilon and stop.

The downside. "When it works." Newton can diverge wildly if your initial guess is far from the root, or if $f'$ is small near the root, or if $f$ is not smooth. It is fast or it is wrong; there is no middle ground. You bracket with bisection, then switch to Newton when you are close. That hybrid is what production root-finders (like SciPy's brentq) actually do.

Secant method — Newton without the derivative

What if you do not have $f'$, just $f$? Use a finite-difference approximation of the derivative from the last two iterates:

x_{n+1} = x_n - f(x_n) \cdot \frac{x_n - x_{n-1}}{f(x_n) - f(x_{n-1})}

Secant method

$x_n, x_{n-1}$: The two most recent guesses.
$(f(x_n) - f(x_{n-1}))/(x_n - x_{n-1})$: A finite-difference estimate of $f'(x_n)$ — the slope of the line (secant) connecting the last two points on the curve.
Update: Same as Newton, but with the secant slope in place of the true derivative.

Convergence. Superlinear, with rate $\phi = (1 + \sqrt{5})/2 \approx 1.618$ — the golden ratio. Faster than bisection, slower than Newton, but no derivative needed. A nice middle ground.

8. Interactive: Newton's method visualizer

The function below is $f(x) = x^3 - 2x - 5$ — a classic cubic with one real root near $x \approx 2.09455$. Click anywhere on the plot to set the initial guess $x_0$. Newton's method will iterate from there, drawing each tangent step in orange until it converges or wanders off. The slider changes the function. Try a bad initial guess near a local extremum and watch what happens.

Function: Max iters: 10

Click anywhere to set an initial guess. Orange segments show the tangent at each iterate and where it crosses zero.

9. Linear systems — direct methods

Given $A \in \mathbb{R}^{n \times n}$ and $b \in \mathbb{R}^n$, find $x$ such that $Ax = b$. You learned in linear algebra that the answer is $x = A^{-1} b$. You should almost never compute it that way — forming $A^{-1}$ is about twice as expensive as solving directly, and numerically worse because you round every entry of the inverse.

Gaussian elimination, honestly

Gaussian elimination turns $A$ into an upper-triangular matrix $U$ by adding multiples of one row to another. The multipliers form a lower-triangular matrix $L$ with ones on the diagonal. The result is:

$$A = LU$$

LU decomposition

$A$: The $n \times n$ input matrix.
$L$: Lower-triangular matrix with 1s on the diagonal. $L_{ij}$ for $i > j$ is the multiplier "row $j$ subtracted from row $i$" during elimination.
$U$: Upper-triangular matrix — what $A$ became after elimination.

Why factor. Once you have $L$ and $U$, solving $Ax = b$ becomes $L(Ux) = b$ — first forward-substitute to get $y$ from $Ly = b$, then back-substitute to get $x$ from $Ux = y$. Each triangular solve is $\mathcal{O}(n^2)$. The factorization is $\mathcal{O}(n^3)$ once, but you can reuse it for many right-hand sides.

Forward substitution unpacks $L y = b$ one equation at a time, top down: $y_1 = b_1$, then $y_2 = b_2 - L_{21} y_1$, and so on. Back substitution does the same for $U x = y$, bottom up: $x_n = y_n / U_{nn}$, then $x_{n-1} = (y_{n-1} - U_{n-1,n} x_n)/U_{n-1,n-1}$.

Why pivoting matters

Naive Gaussian elimination can catastrophically fail on matrices that are perfectly invertible. Consider:

A = \begin{pmatrix} 10^{-20} & 1 \\ 1 & 1 \end{pmatrix}, \quad b = \begin{pmatrix} 1 \\ 2 \end{pmatrix}

Tiny-pivot example

$A$: A $2 \times 2$ matrix with a very small top-left entry.
$b$: The right-hand side.
True solution: $x \approx (1, 1)$ — close to the line-intersection of the two rows.

What breaks. Naive elimination uses $A_{11} = 10^{-20}$ as the pivot. It subtracts $10^{20}$ times row 1 from row 2, which gives a second equation that is completely dominated by the $10^{20}$ multiplier. In FP64, the original $1$s in row 2 are lost to rounding. Back-substitution returns $x_2 \approx 1$ (fine) but $x_1 \approx 0$ (wildly wrong).

The fix is partial pivoting: at each elimination step, swap rows so the pivot (the element you are dividing by) is the largest in absolute value of the candidates. This gives you $PA = LU$, where $P$ is a permutation matrix recording the swaps. It is what every production solver does, and what numpy.linalg.solve calls under the hood. It costs almost nothing and makes the factorization backward-stable in practice.

PERFORMANCE NOTE

LU with partial pivoting is $\mathcal{O}(n^3)$ in floating-point operations. For $n = 1000$, that is $\sim 10^9$ ops — under a second on a laptop. For $n = 10^4$ it is $\sim 10^{12}$ — minutes. For $n = 10^6$, forget direct methods entirely. That is where the next section comes in.

10. Iterative methods for linear systems

When $A$ is enormous and sparse — which it is for most physical simulations, where each variable interacts with only a handful of neighbors — direct methods are too expensive. You use iterative methods instead. They start with a guess $x_0$ and refine it by applying $A$ as a matrix-vector multiply (which is cheap if $A$ is sparse) until the residual is small.

Jacobi

Split $A = D + R$, where $D$ is the diagonal and $R$ is everything else. Rewrite $Ax = b$ as $Dx = b - Rx$, and iterate:

x^{(k+1)} = D^{-1}(b - R x^{(k)})

Jacobi iteration

$x^{(k)}$: The $k$-th iterate (a full vector). We use the parenthesized superscript to distinguish iteration index from component index.
$D$: The diagonal part of $A$. $D^{-1}$ is trivial to apply — just divide component $i$ by $A_{ii}$.
$R = A - D$: The off-diagonal part of $A$. Cheap to apply if $A$ is sparse.

Convergence. Jacobi converges if $A$ is strictly diagonally dominant ($|A_{ii}| > \sum_{j \ne i} |A_{ij}|$ for all $i$). Easy to parallelize — every component of $x^{(k+1)}$ uses only $x^{(k)}$, no dependencies between them. Unfortunately it converges slowly.

Gauss–Seidel

Same idea, but use the newly updated components as soon as they are available. Component-wise:

x_i^{(k+1)} = \frac{1}{A_{ii}}\left(b_i - \sum_{j < i} A_{ij} x_j^{(k+1)} - \sum_{j > i} A_{ij} x_j^{(k)}\right)

Gauss–Seidel iteration

$x_i^{(k+1)}$: The new value of component $i$.
$j < i$: Components that have already been updated this sweep — use the new $x_j^{(k+1)}$.
$j > i$: Components not yet updated — still use the old $x_j^{(k)}$.

Convergence. Typically about twice as fast as Jacobi, with similar conditions. Less parallelizable, since component $i$ depends on component $i-1$. The basis of Successive Over-Relaxation (SOR), which adds a damping parameter to accelerate further.

Conjugate gradient

For symmetric positive definite $A$, the gold standard is conjugate gradient (CG), introduced by Hestenes and Stiefel in 1952. CG builds up the solution in a Krylov subspace — each iteration is $\mathcal{O}(\text{nnz}(A))$ (cost of one sparse matvec), and it is guaranteed to converge in at most $n$ steps in exact arithmetic (and usually far fewer in practice). Preconditioned CG is what solves the linear systems inside finite-element simulations, and it is the direct ancestor of the trust-region methods at the heart of optimization libraries.

Rule of thumb: use direct methods when $n < 10^4$ or $A$ is dense. Use iterative methods when $n > 10^5$ or $A$ is sparse. Between those, profile.

11. Interpolation

Given $n+1$ points $(x_0, y_0), \dots, (x_n, y_n)$ with distinct $x_i$, find a function $p$ that passes through all of them. The natural choice is a polynomial of degree $\le n$ — and there is exactly one such polynomial.

Lagrange form

p(x) = \sum_{i=0}^{n} y_i \cdot \ell_i(x), \quad \ell_i(x) = \prod_{j \ne i} \frac{x - x_j}{x_i - x_j}

Lagrange interpolating polynomial

$n + 1$: The number of data points.
$(x_i, y_i)$: The $i$-th data point. All $x_i$ are distinct.
$\ell_i(x)$: The $i$-th Lagrange basis polynomial. By construction $\ell_i(x_j) = 1$ if $i = j$ and $0$ otherwise.
$p(x)$: The unique interpolating polynomial. Evaluated at $x_j$, every $\ell_i$ is zero except $\ell_j$, which is one, so $p(x_j) = y_j$. Done.

Intuition. Each basis polynomial is a "switch" — 1 at its own point, 0 at all others. The interpolant is a weighted sum where the weights are the actual $y_i$ values. It is elegant but numerically fragile, especially for high $n$.

Newton's divided differences

A more computationally friendly form builds the polynomial incrementally:

p(x) = f[x_0] + f[x_0, x_1](x - x_0) + f[x_0, x_1, x_2](x - x_0)(x - x_1) + \cdots

Newton divided differences

$f[x_i] = y_i$: Zeroth-order divided difference — just the value.
$f[x_i, x_{i+1}] = (f[x_{i+1}] - f[x_i])/(x_{i+1} - x_i)$: First-order divided difference — a slope.
$f[x_i, \dots, x_{i+k}]$: $k$-th-order divided difference, defined recursively.

Why bother. Adding a new data point only requires computing one more coefficient — you do not rebuild the polynomial. And the divided differences converge to derivatives of $f$ in the limit, which is a pretty chart.

Runge's phenomenon — the warning

High-degree polynomial interpolation at equally spaced points is a trap. Carl Runge showed in 1901 that interpolating $f(x) = 1/(1 + 25 x^2)$ at equally spaced points on $[-1, 1]$ gives an interpolant whose error near the endpoints grows without bound as you add more points. The polynomial wiggles violently at the edges.

DO NOT DO THIS

Never use a degree-20+ polynomial to interpolate data at uniformly spaced points. The result looks smooth near the middle and explodes near the ends. This is counterintuitive — "more data points" feels like it should help — but it categorically does not.

Two fixes:

Chebyshev nodes. Sample $x_i$ at the projections of equally spaced angles on a half-circle — $x_i = \cos((2i+1)\pi/(2n+2))$. This clusters points near the endpoints, where the interpolant needs the most data. Runge's phenomenon goes away.
Piecewise low-degree polynomials (splines). Instead of one high-degree polynomial for the whole interval, use a cubic on each subinterval, matched at the knots so that the first and second derivatives are continuous. These are cubic splines, the default in scientific plotting libraries and CAD tools. They are stable, smooth, and converge reliably as you add points.

12. Numerical integration (quadrature)

Given a function $f$ on an interval $[a, b]$, approximate $\int_a^b f(x)\,dx$. Most integrals you encounter in practice do not have closed forms — the Gaussian CDF, Bessel functions, almost any ML loss integrated over a data distribution — so you need to compute them numerically.

The family of Newton–Cotes rules

Split $[a, b]$ into $n$ subintervals of width $h = (b-a)/n$. Replace $f$ with a local polynomial on each subinterval and integrate the polynomial exactly. Three classic choices, in ascending order of sophistication.

Midpoint rule (zeroth-order — replace $f$ with its midpoint value on each subinterval):

\int_a^b f(x)\,dx \approx h \sum_{i=0}^{n-1} f\!\left(a + \left(i + \tfrac{1}{2}\right) h\right)

Composite midpoint rule

$h = (b-a)/n$: The subinterval width.
$a + (i + 1/2) h$: The center of the $i$-th subinterval.
Error: $\mathcal{O}(h^2)$ — halve $h$, divide error by 4.

Picture. A histogram whose bar height is the function value at the bar's center. Surprisingly good; it beats the trapezoidal rule for the same cost on smooth functions, because it gets second-order accuracy from its symmetry.

Trapezoidal rule (first-order — connect consecutive sample points with straight lines and integrate the trapezoids):

\int_a^b f(x)\,dx \approx \frac{h}{2}\left[f(a) + 2\sum_{i=1}^{n-1} f(a + i h) + f(b)\right]

Composite trapezoidal rule

$f(a), f(b)$: Endpoint values, counted once.
$f(a + ih)$: Interior points, counted twice (each one belongs to two adjacent trapezoids).
$h/2$: Half the subinterval width — the "base over 2" factor from the trapezoid area formula.
Error: $\mathcal{O}(h^2)$. Formally, error $= -\frac{(b-a)h^2}{12} f''(\xi)$ for some $\xi \in [a, b]$. Proportional to $f''$ — so trapezoidal is exact on linear functions.

Why you know this. The trapezoidal rule is the default fallback integral. It is simple, reliable, and has an extremely nice property: on smooth periodic functions over a full period, it converges exponentially fast, not polynomially. This is why Fourier-transform based methods work so well on periodic data.

Simpson's rule (second-order — fit a parabola through three consecutive sample points and integrate that):

\int_a^b f(x)\,dx \approx \frac{h}{3}\left[f(a) + 4 \!\!\sum_{i \text{ odd}} f(a + ih) + 2 \!\!\sum_{i \text{ even}} f(a + ih) + f(b)\right]

Composite Simpson's rule

$n$: Must be even — Simpson's rule integrates a parabola over pairs of subintervals.
Coefficients $1, 4, 2, 4, 2, \dots, 4, 1$: Weight pattern from integrating the local parabola. Endpoints once, odd interior points $\times 4$, even interior points $\times 2$.
Error: $\mathcal{O}(h^4)$ — halve $h$, divide error by 16. Much faster convergence than trapezoidal for the same number of function evaluations.

The upgrade. Trapezoidal fits lines between samples; Simpson fits parabolas through trios of samples. Same number of evaluations, much better answer on smooth functions. It is the first thing to reach for when trapezoidal is not accurate enough.

Gaussian quadrature (brief)

Newton–Cotes uses equally spaced sample points. What if you get to choose where to sample? Gauss showed in 1814 that by picking $n$ optimally-placed points (the roots of Legendre polynomials) with optimally-chosen weights, you can exactly integrate polynomials up to degree $2n - 1$. That is a factor of two better than any equally-spaced rule, and for smooth functions the error converges exponentially in $n$. This is what SciPy's quad uses internally when it adapts to a region. You almost never need to implement this yourself — just know the name.

13. Numerical differentiation

Given a function you can evaluate but cannot symbolically differentiate, approximate $f'(x)$. The natural starting point is the limit definition:

f'(x) \approx \frac{f(x + h) - f(x)}{h}

Forward finite difference

$h$: A small positive step size.
$f(x+h) - f(x)$: Change in $f$ over a step of size $h$ — the numerator of the derivative definition.
Error: $\mathcal{O}(h)$ — first-order. Cut $h$ in half, cut error in half. Proof: Taylor-expand $f(x+h) = f(x) + h f'(x) + \frac{h^2}{2} f''(\xi)$, rearrange.

The central difference uses a symmetric two-sided step and is much better:

f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}

Central finite difference

$f(x+h), f(x-h)$: Two symmetric samples on either side of $x$.
$2h$: The full stride between them.
Error: $\mathcal{O}(h^2)$ — second-order. The odd-power terms in the Taylor expansion cancel by symmetry, leaving only $h^2$ and higher.

Always prefer central. For the same cost (two evaluations), it gives you second-order accuracy instead of first. The forward difference is only preferable when you physically cannot step backwards — at the boundary of a domain, say.

The optimal step size — and why it's annoying

You would think the smaller the $h$, the better. In exact arithmetic, yes. In floating-point, no. Consider the central difference error as a sum of two things:

E(h) \approx \underbrace{\frac{|f'''(x)|}{6} h^2}_{\text{truncation}} + \underbrace{\frac{2 \varepsilon_{\text{mach}} |f(x)|}{h}}_{\text{roundoff}}

Finite-difference error balance

Truncation term $\propto h^2$: The error in pretending a Taylor expansion stops at the first-derivative term. Shrinks as you shrink $h$.
Roundoff term $\propto 1/h$: The cancellation error in computing $f(x+h) - f(x-h)$. When $h$ is tiny, $f(x+h)$ and $f(x-h)$ are nearly equal, so you lose precision in their difference. Then you divide by a tiny number, which amplifies the remaining error. Grows as you shrink $h$.
$\varepsilon_{\text{mach}}$: Machine epsilon — sets the scale of the roundoff in each function evaluation.

The optimum. Differentiating with respect to $h$ and setting to zero, the optimal step size is $h^* \sim \varepsilon_{\text{mach}}^{1/3}$, which for FP64 is about $10^{-5}$. Not $10^{-16}$. The minimum achievable error is $\sim \varepsilon_{\text{mach}}^{2/3} \approx 10^{-11}$ — you lose about 5 digits of accuracy to the finite-difference game, no matter how careful you are.

This is one of the reasons symbolic differentiation (for clean formulas) and automatic differentiation (for programs) dominate in machine learning. Autodiff applies the chain rule through the computation graph exactly, using only the floating-point operations you would do anyway — you get full-precision derivatives at essentially the cost of one extra forward pass. Finite differences are now mostly used for sanity-checking autodiff implementations (the gradcheck pattern) and for gradient-free optimization. See gradient descent for more on how autodiff plugs into training.

14. ODE integration

Given an initial value problem $y'(t) = f(t, y(t))$ with $y(t_0) = y_0$, you want to approximate $y(t)$ for $t > t_0$. This is the fundamental primitive of every physical simulation on earth. The basic idea: take small steps in $t$, updating $y$ using a local approximation of the slope.

Euler's method

The simplest: just use the current slope to step forward by $h$:

y_{n+1} = y_n + h \cdot f(t_n, y_n)

Forward Euler

$y_n$: Approximate value of $y$ at time $t_n = t_0 + nh$.
$f(t_n, y_n)$: The slope — the right-hand side of the ODE — evaluated at the current state.
$h$: The time step.

Intuition. Start at $(t_n, y_n)$, pretend the slope is constant over $[t_n, t_n+h]$, follow the tangent line. First-order accurate: global error $\mathcal{O}(h)$. Simple but often unstable — a step that is too big makes it explode, especially on stiff equations. You will see it in every textbook and almost never in production.

Runge–Kutta (RK4)

A clever fourth-order scheme that evaluates the slope at four strategic points inside each step and averages them with magic coefficients:

\begin{aligned} k_1 &= f(t_n, y_n) \\ k_2 &= f(t_n + h/2,\, y_n + h k_1 / 2) \\ k_3 &= f(t_n + h/2,\, y_n + h k_2 / 2) \\ k_4 &= f(t_n + h,\, y_n + h k_3) \\ y_{n+1} &= y_n + \frac{h}{6}(k_1 + 2 k_2 + 2 k_3 + k_4) \end{aligned}

Classical RK4

$k_1$: Slope at the start of the step.
$k_2$: Slope at the midpoint, using $k_1$ to predict where we'd be.
$k_3$: Slope at the midpoint again, using the better estimate $k_2$.
$k_4$: Slope at the end of the step, using $k_3$.
Weighted average: Gives fourth-order accuracy — global error $\mathcal{O}(h^4)$.

Why it rules. RK4 gives you four orders of accuracy for four function evaluations per step. The combination of easy implementation and high accuracy made it the default ODE integrator for five decades. Modern production solvers (Dormand–Prince 5(4), which is what scipy.integrate.solve_ivp defaults to) use adaptive step sizes on top of the same principle, but the spirit is the same.

15. Source code

Four canonical routines in NumPy. These are the "write them once from scratch so you understand them" versions — not replacements for scipy.optimize or numpy.linalg.

numerical analysis · from scratch

import numpy as np

def newton(f, fp, x0, tol=1e-12, max_iter=50):
    # Find a root of f using Newton's method.
    # f  : callable returning f(x)
    # fp : callable returning f'(x)
    # x0 : initial guess
    x = float(x0)
    for k in range(max_iter):
        fx  = f(x)
        fpx = fp(x)
        if abs(fpx) < 1e-14:
            raise RuntimeError("derivative vanished — Newton cannot step")
        step = fx / fpx
        x    = x - step
        if abs(step) < tol * max(1.0, abs(x)):
            return x, k + 1
    raise RuntimeError(f"Newton failed to converge in {max_iter} iterations")

# Solve x^3 - 2x - 5 = 0 (Wallis's classic test, root near 2.0946)
f  = lambda x: x**3 - 2*x - 5
fp = lambda x: 3*x**2 - 2
root, iters = newton(f, fp, x0=2.0)
print(f"root = {root:.15f} in {iters} iterations")
# → root = 2.094551481542327 in 5 iterations (quadratic convergence)

import numpy as np

def trapezoidal(f, a, b, n):
    # Composite trapezoidal rule on [a, b] with n subintervals.
    # Error is O(h^2) for smooth f. For periodic smooth f over a
    # full period, error actually decays exponentially — trapezoidal
    # is the best tool you already know for Fourier-type integrals.
    x = np.linspace(a, b, n + 1)   # n+1 grid points
    y = f(x)
    h = (b - a) / n
    return h * (0.5*y[0] + y[1:-1].sum() + 0.5*y[-1])

# Integrate sin(x) over [0, pi] (true answer = 2)
for n in [10, 100, 1000]:
    approx = trapezoidal(np.sin, 0, np.pi, n)
    err    = abs(approx - 2.0)
    print(f"n={n:5d}  approx={approx:.10f}  error={err:.2e}")
# Error drops by ~100x each time n grows 10x — that's O(h^2).

import numpy as np

def lu_pivot(A):
    # Return (P, L, U) such that P @ A = L @ U, with partial pivoting.
    # P is stored as a permutation vector for efficiency.
    A = A.astype(float).copy()
    n = A.shape[0]
    P = np.arange(n)
    L = np.eye(n)
    for k in range(n - 1):
        # Pick the pivot — the largest absolute value in column k below row k
        piv = k + np.argmax(np.abs(A[k:, k]))
        if piv != k:
            A[[k, piv]] = A[[piv, k]]           # swap rows in A
            P[[k, piv]] = P[[piv, k]]           # record the swap
            if k > 0:
                L[[k, piv], :k] = L[[piv, k], :k]
        # Eliminate below the pivot
        for i in range(k + 1, n):
            L[i, k] = A[i, k] / A[k, k]
            A[i, k:] -= L[i, k] * A[k, k:]
    U = np.triu(A)
    return P, L, U

def solve(A, b):
    P, L, U = lu_pivot(A)
    bp = b[P]
    # Forward substitution: L y = bp
    n  = len(b)
    y  = np.zeros_like(bp, dtype=float)
    for i in range(n):
        y[i] = bp[i] - L[i, :i] @ y[:i]
    # Back substitution: U x = y
    x = np.zeros_like(y)
    for i in range(n - 1, -1, -1):
        x[i] = (y[i] - U[i, i+1:] @ x[i+1:]) / U[i, i]
    return x

# Test on the tiny-pivot example from section 9
A = np.array([[1e-20, 1.0], [1.0, 1.0]])
b = np.array([1.0, 2.0])
print(solve(A, b))   # → [1. 1.] — correct, because we pivoted

import numpy as np

def central_diff(f, x, h):
    return (f(x + h) - f(x - h)) / (2 * h)

# Approximate d/dx [sin(x)] at x = 1 (true answer = cos(1) ≈ 0.5403)
# Sweep h across many orders of magnitude — watch the classic U shape
# where truncation error dominates for big h and roundoff for small h.
x     = 1.0
truth = np.cos(x)
print(f"{'h':>10}  {'approx':>20}  {'|error|':>14}")
for k in range(1, 17):
    h = 10.0 ** (-k)
    approx = central_diff(np.sin, x, h)
    err    = abs(approx - truth)
    print(f"{h:10.0e}  {approx:20.15f}  {err:14.2e}")

# Expected output: error drops as h^2 from h=1e-1 down to about h=1e-5,
# then climbs again as roundoff takes over. The optimum sits near
# h ~ eps_mach^(1/3) ≈ 6e-6, with best error ~ eps_mach^(2/3) ≈ 4e-11.
# Five digits of precision lost to the finite-difference game, forever.

16. Connection to machine learning

Every piece of this page has a ghost running around inside modern ML. Spotting those ghosts makes a lot of otherwise-mysterious ML advice suddenly obvious.

Gradient descent is a numerical method. Training a neural net is just numerically minimizing a function. The learning rate is a step size, and exactly the same truncation-versus-roundoff tradeoffs apply. Too big a step — you overshoot the minimum, or diverge outright like Euler with too large $h$. Too small — you waste compute on round-off-limited progress. See gradient descent and optimization.
Batch size affects numerical stability. A gradient computed over a tiny batch is noisy, and summing noisy quantities has bigger relative error than summing smooth ones. Very small batches in FP16 can make gradients underflow entirely. This is one reason "gradient accumulation" exists — you accumulate multiple small-batch gradients into a higher-precision buffer before updating.
Why BF16 became standard. LLM gradients span ~20 orders of magnitude during training. FP16's exponent range ($\sim 10^{\pm 5}$) is not enough — gradients underflow to zero and training stalls. BF16 keeps FP32's exponent range and spends its savings on mantissa. It is the "right" precision for this problem, and it took the industry until about 2019 to figure out that range beat precision.
Why FP8 training is hard. With only 3–4 mantissa bits, machine epsilon is on the order of $10^{-1}$ — worse than percent-level. You cannot just cast the weights and hope. Real FP8 training uses per-tensor scaling factors, maintains master weights in higher precision, and carefully chooses which operations go in which format. It is a numerical-analysis problem in a trenchcoat.
Quantization is floating-point in reverse. Post-training quantization takes a trained FP32 model and rounds its weights to INT8 (or INT4, or binary). The tradeoffs are the same as choosing a training precision — how much range do you need, how much resolution do you need, where does the noise matter. See edge AI for the engineering story.
Autodiff replaces finite differences. We flogged this above, but it bears repeating. The whole reason backpropagation is not "just" finite differences is that finite differences cannot actually compute high-dimensional gradients accurately — and there is no $h$ that works for a million-parameter model simultaneously. Autodiff gives you exact (up to rounding) gradients at the cost of a constant-factor-more compute. It is the single most important numerical method in modern machine learning, and you should read multivariable calculus next if you want to see why.

17. Summary

Computers cannot represent most real numbers exactly. Every operation rounds to the nearest representable float, with relative error bounded by machine epsilon.
IEEE 754 is the standard format. FP64 gets you 16 digits, FP32 gets you 7, BF16 gets you 3. ML uses less precision than scientific computing because range matters more than resolution for gradients.
Catastrophic cancellation is the usual villain: subtract two near-equal numbers and you lose all the digits that agreed. Algebraic rewrites fix most cases.
Conditioning is about the problem, stability is about the algorithm. Ill-conditioned problems give bad answers regardless of how clean your code is.
Newton's method converges quadratically when it works; bisection always works but only linearly. Hybrids give you both.
Gaussian elimination with partial pivoting is the backward-stable direct solver. For big sparse systems, switch to iterative methods — Jacobi, Gauss–Seidel, conjugate gradient.
Runge's phenomenon: high-degree polynomial interpolation at equally spaced points explodes at the endpoints. Use Chebyshev nodes or splines instead.
Trapezoidal is $\mathcal{O}(h^2)$, Simpson is $\mathcal{O}(h^4)$, Gaussian quadrature is spectrally accurate. Pick based on smoothness.
Finite differences have an optimal step size around $\varepsilon_{\text{mach}}^{1/3}$ and a floor error around $\varepsilon_{\text{mach}}^{2/3}$. This is why autodiff dominates in ML.
Euler explodes, RK4 works. Production ODE solvers are adaptive-step RK variants.
Every numerical lesson on this page has a ghost inside every training run of every modern ML model.