← Learning Hub Mathematics Track AI / ML CS
UPDATED Apr 11 2026 18:45

Mathematics — The Language of Quantitative Reasoning

A working tour of the seven math subjects that power physics, chemistry, engineering, finance, and AI. Differential equations for motion and fields, linear algebra for quantum states and structural analysis, probability for statistical mechanics and inference, optimization for every kind of design tradeoff. Hover any dotted term. Poke at the interactive figures. When a section feels too thin, click its Deep Dive link for the full lesson.

What math you actually need // big picture

Math is the shared language across quantitative fields. A physicist, a chemist, a structural engineer, a quant trader, and a machine-learning researcher all reach for the same seven subjects when the work gets serious. You don't need to remember everything from your last math class. You need a working command of those seven, because calculus describes motion and rates, linear algebra describes states and transformations, and probability describes uncertainty — and every quantitative field is built on some combination of the three.

The other four show up the moment the problem gets concrete. You can't do statistical mechanics or climate modelling without statistics, you can't reason about molecular graphs or network effects without discrete math, and you can't explain why a finite-element simulation diverges without numerical analysis. These tools are less glamorous than the big three, but they're the ones that stop a model from silently lying to you.

This page is a fast tour of all seven. Each section is a 3–5 paragraph summary with one interactive figure you can poke at, followed by a link to the full deep-dive page. Read the ones you need, skip the ones you don't, come back when the work demands another piece. What you don't need unless your specific field asks for it: abstract algebra past "groups exist," real analysis past "limits are a pain to formalize," Galois theory (unless you're doing cryptography or coding theory), category theory (unless you're writing Haskell), and most of olympiad-style problem-solving. Fun, but not on the critical path.

The 60-second summary

Calculus — instantaneous change. The chain rule runs every training loop, and differential equations describe every physical system that moves.
Multivariable calculus — change in many directions at once. Gradients, Jacobians, Hessians. The mechanics of fields, optimizers, and thermodynamic potentials.
Linear algebra — the universal language of states and transformations. Vectors, matrices, eigenvalues, SVD. Quantum states, rigid-body motion, and neural-network weights all live here.
Probability — reasoning under uncertainty. Random variables, Bayes, entropy. The backbone of statistical mechanics, insurance, and inference.
Statistics — from a pile of data to a defensible claim. Estimation, testing, regression. Used by everyone from epidemiologists to particle physicists.
Discrete mathematics — the grammar of structure. Sets, logic, counting, graphs. Shows up in chemistry (molecular graphs), logistics (scheduling), and CS.
Numerical analysis — what a real computer does when asked to do math it cannot do exactly. Finite-element methods, climate simulations, and orbital integrators all depend on it.

What quantitative work actually asks of you

  • Derivatives. Rates of change describe motion in physics, reaction rates in chemistry, marginal cost in economics, and loss in machine learning. All the same calculus.
  • Matrix shapes. A matrix is a linear map from one space to another. The same object describes a rotation in robotics, a Hamiltonian in quantum mechanics, an adjacency structure in a chemical graph, and a weight layer in a neural network. If you know what a matrix means, you can read any of them.
  • Distributions that sum to one. Classifiers, polling models, particle distributions in statistical mechanics, and option-pricing models all output probabilities. Softmax, cross-entropy, and KL divergence are the same idea in three costumes.
  • Sample size intuition. A clinical trial with 50 patients and a benchmark with 50 prompts both have enormous confidence intervals. Statistics tells you how enormous.
  • Floating-point sanity. Climate simulations that diverge, finite-element stresses that explode, and training losses that go to NaN are all numerical-analysis problems in disguise.

How to use this track // four goal-driven paths

The seven subjects interlock. You can read them in any order, but if you have a specific goal in mind — modelling a physical system, reading a finance paper, passing a chemistry exam, following a machine-learning paper — here are the shortest paths through. Each one points at sections on this page plus the deep-dive pages that flesh them out.

"I want to understand backpropagation."

calculus → multivar → linalg

Read Calculus for the chain rule. Then Multivariable Calculus for gradients and the multivariable chain rule. Then Linear Algebra so the matrix-valued chain rule stops looking like noise. Finish at Backpropagation →.

popular path

"I want transformers to click."

linalg → probability

Start with Linear Algebra — dot products, projections, matrix multiplication as composition. Then Probability for softmax and the information theory that runs the loss. Payoff: Self-Attention →.

popular path

"I want to derive gradient descent."

multivar → numerical → cs/optimization

Multivariable Calculus first (gradients point uphill). Then Numerical Analysis for step size, conditioning, and convergence. Cross-track finish at CS: Optimization.

path

"I want Bayesian inference to click."

probability → statistics

Probability up through Bayes' theorem, then Statistics for MLE and the Bayesian-vs-frequentist split. Payoff: you will never again confuse P(A|B) with P(B|A) in a meeting.

path
Rule of thumb

Read the overview, then any one subject whose demo you find interesting. Do the demo. Click the Deep Dive link. Come back to this page when you need the next piece. Trying to read all seven sections in one sitting is a way of reading nothing.

Calculus // single-variable foundation

Two ancient questions — "how fast is this changing right now?" and "what's the area under this curve?" — turn out to be the same question, and there's a mechanical procedure for solving either one.

Calculus starts with a problem that sounds nonsensical: if your speedometer reads 37 mph, what does that number mean at an instant? You weren't going 37 mph for an hour. You were at 37 mph for no time at all, which makes "distance over time" a division by zero. The trick, invented in the 1660s by Newton and Leibniz independently, is to form that ratio over a tiny interval and then ask what it approaches as the interval shrinks. That approaching value is called a limit, and it is the single foundational idea in all of calculus.

Derivatives — instantaneous rates of change

Take the limit of the average rate of change, and you get the derivative, written f'(x) or df/dx. Geometrically: the slope of the tangent line to the graph of f at the point x. Mechanically: a new function that tells you, at every input, how fast the original function is changing. For f(x) = x², the derivative is f'(x) = 2x, which you can read as "at x = 3, the curve is getting steeper at a rate of 6." The four or five rules you memorize in a semester (power rule, product rule, quotient rule, chain rule) are really just the consequences of that one definition plus algebra.

The only rule that matters for machine learning is the chain rule: if h(x) = f(g(x)), then h'(x) = f'(g(x)) · g'(x). In words: to differentiate a composition, differentiate the outer function with the inner one plugged in, times the derivative of the inner. Every neural network is a composition of simple functions, one per layer. Training one means computing dLoss/dWeight for every weight, which means applying the chain rule thousands of times. That's what backpropagation is — the chain rule, bookkept cleverly.

Integrals — accumulated totals

The other half of calculus is the opposite move: given a rate of change, recover the total. If you know your speed at every instant, you can recover the total distance travelled by "adding up" speed times tiny time slices. That's an integral, written with an elongated S (which Leibniz chose because it was the initial of summa). Written ∫ f(x) dx and read "the integral of f of x, dx."

The Fundamental Theorem of Calculus is the payoff: differentiation and integration are inverses of each other, the way addition and subtraction are inverses. If you can find an antiderivative — a function whose derivative is f — you can evaluate ∫ₐᵇ f(x) dx exactly by subtracting the antiderivative's values at the endpoints. Half of calculus class is learning tricks to find antiderivatives (substitution, integration by parts, partial fractions). The good news is that numerical methods have mostly made those tricks obsolete for working programmers. See Numerical Analysis for how a computer actually does integrals.

Taylor series — polynomial approximation of everything

The last big idea in single-variable calculus: near a point, almost any smooth function looks like a polynomial. Specifically, f(x) ≈ f(a) + f'(a)(x−a) + ½ f''(a)(x−a)² + …. This is the Taylor series, and it's the reason why calculators can compute sin(x), why e^x has a nice definition, and why every optimization method you'll meet — Newton's method, quasi-Newton, Gauss-Newton — starts by replacing a complicated function with its first- or second-order Taylor approximation and solving that instead.

Rule of thumb

If you can explain why d/dx[sin(x²)] = cos(x²) · 2x out loud using the words "outer function" and "inner function," you understand the chain rule well enough to read any ML paper written this decade.

▸ Interactive derivative — drag the tangent x = 1.00
f(x) x
Function & point
Pick a function to differentiate:
f(x) = x² f(x) = x³ f(x) = sin(x) f(x) = eˣ
Ready
Press Step → to walk through the derivative's definition at this point.
Tip: change x, or pick another function with a preset.

A sibling topic that belongs here but pays off later: limits of sequences and series. Knowing when an infinite sum converges matters any time you see an algorithm that iterates "until close enough." See Numerical Analysis for the practical side.

Deep Dive — full lesson with limits, proofs, integration tricks, and Taylor series →

Multivariable Calculus // gradients, fields, and surfaces

Once a function has more than one input, "slope" splits into a whole vector of slopes — one per input. That vector is the gradient, and it shows up everywhere from electromagnetism to thermodynamics to the core of every machine-learning optimizer.

A machine learning model is a function with a huge number of inputs: every weight is an input, and the output is a single number (the loss). For GPT-4 that's roughly a trillion inputs and one output. Single-variable calculus handles one-in, one-out. For a trillion-in, one-out you need multivariable calculus — which mostly means learning the right vocabulary for "derivatives in many directions at once."

Partial derivatives and the gradient

The simplest generalization is the partial derivative, written ∂f/∂xᵢ. You freeze every input except the i-th one and take a normal derivative. Do this for every input and you get a vector of partials called the gradient, written ∇f (read "grad f"). The gradient has a beautiful geometric meaning: at any point, it points in the direction of steepest ascent of the function, and its length is how fast the function is climbing in that direction. This is the single fact behind Gradient Descent → — to minimize a loss, step in the direction opposite the gradient.

∇f(x) = [ ∂f/∂x₁ , ∂f/∂x₂ , … , ∂f/∂xₙ ]

The gradient vector

∇f
"Grad f." The gradient of the function f at the point x. It's a vector with the same shape as the input.
∂f/∂xᵢ
Partial derivative of f with respect to the i-th input. "If I nudge only xᵢ up by a tiny amount, how fast does f change?"
x
The point at which we're evaluating the gradient — itself a vector of n numbers.
n
The number of inputs. For a neural network, this is the total number of trainable weights.

Analogy. You're standing on a hill in fog. Kneel down and feel the slope with your hand in every direction. The compass bearing where the slope is steepest uphill is the gradient's direction. The steepness itself is the gradient's length. To minimize, walk the opposite way.

Jacobians — many inputs AND many outputs

If the output is also a vector (say, a layer in a neural net maps a 768-dim vector to another 768-dim vector), then "derivative" becomes a whole matrix — the Jacobian, written J. Entry J[i, j] is ∂(output i)/∂(input j). The Jacobian is the linear map that best approximates the function near a point, and the multivariable chain rule — "the Jacobian of a composition is the product of the Jacobians" — is literally how backprop propagates gradients backward through a network. Each layer contributes a Jacobian; the final gradient is the product of all of them.

Hessians — curvature

Take the derivative of the gradient and you get the Hessian, a square matrix of second partial derivatives. It describes curvature: whether you're in a bowl (positive definite Hessian, local minimum), on a ridge (indefinite, saddle), or on a dome (negative definite, maximum). Second-order optimization methods like Newton's method use the Hessian directly. First-order methods like plain SGD don't, which is why they sometimes get stuck on saddle points that a second-order method would walk straight off of.

You do not need to compute a trillion-by-trillion Hessian by hand. You need to know what one is so you can read a paper that says "the loss landscape has poor conditioning near the minimum" without flinching.

▸ 2-D gradient field — click anywhere to see the gradient click to start
Function
Pick a surface:
x² + y² x² − y² Rosenbrock sin(x)·cos(y)
Ready
Press Step → to descend one gradient-descent step, or edit (x, y) to jump.
Arrows show the gradient direction at each sampled point. Pink line is the current gradient.

Deep Dive — partials, gradients, Jacobians, Hessians, Lagrange multipliers →  ·  Backpropagation →

Linear Algebra // the language of states and transformations

Quantum mechanics, structural engineering, and modern ML papers are all written in linear algebra. Get comfortable reading it and the notation stops being a wall.

Linear algebra begins from a simple observation: most of the operations you want to do on lists of numbers — rotations, projections, weighted sums, changes of coordinates — can be written using two primitive operations (adding vectors and multiplying by a scalar) combined into a third (multiplying by a matrix). Once you accept that framing, every "transformer attention" diagram, every "embedding lookup," every "projection into residual stream" becomes a well-defined piece of linear algebra instead of a vague metaphor.

Vectors — lists with geometry

A vector is an ordered list of numbers. It represents a point in n-dimensional space, or equivalently an arrow from the origin to that point. The two things you can do to vectors are: add them (tip-to-tail), and scale them (stretch or shrink). Vectors of the same length live in a vector space, and everything else in linear algebra is consequences of those two operations.

The most important operation on two vectors is the dot product: multiply corresponding entries and sum. The dot product has a geometric interpretation that is secretly the whole reason linear algebra is useful in ML: a · b = ∥a∥ ∥b∥ cos(θ), where θ is the angle between the two vectors. If two vectors point in the same direction, their dot product is big and positive. If they point perpendicular, it's zero. If they point opposite, it's big and negative. Every attention score, every similarity metric, every "how much does this query match this key" is a dot product.

Matrices — functions in disguise

A matrix is a grid of numbers, but the useful thing about it is that it represents a linear function from one vector space to another. The function takes a vector in, multiplies, and spits out another vector. Matrix multiplication is just function composition: if M turns a vector into another vector, and N does the same, then NM is "first do M, then do N." This is why matrix multiplication is not commutative (first hat, then pants ≠ first pants, then hat), and why the shapes have to line up.

Every layer of a neural network is, before the activation function, a matrix multiplication. The weights of the layer are the matrix. The activations going in are the input vector. The pre-activations coming out are the output vector. A modern transformer is, at the level of linear algebra, "a few dozen of these stacked, with some nonlinearities and attention sprinkled in between."

Eigenvalues — directions the transform leaves alone

Most vectors, when you hit them with a matrix, get rotated and rescaled into some new direction. But some special vectors get only rescaled — they come out pointing the same way, just stretched or shrunk. Those are eigenvectors, and the scale factor is the eigenvalue. "Eigen" is German for "own," as in "its own direction." If you can find a full set of eigenvectors, you can diagonalize the matrix and its iterated behavior becomes trivial to understand. PageRank is an eigenvector. The stationary distribution of a Markov chain is an eigenvector. The principal components of a dataset are eigenvectors of the covariance matrix. It keeps coming up.

SVD and PCA — the everything-decomposition

The Singular Value Decomposition generalizes eigenvalues to non-square matrices and is the most powerful tool in all of applied linear algebra. Every matrix M, no matter how weird, factors as M = U Σ VT, where U and V are rotation matrices and Σ is a diagonal matrix of singular values — non-negative numbers that measure how much M stretches each direction. Truncate Σ to the top k singular values and you get the best rank-k approximation of M, which is the mathematical engine inside PCA, latent semantic analysis, and every low-rank adaptation scheme (LoRA, for example). If you understand SVD, you understand compression, dimensionality reduction, and half of "solving Ax = b when A is badly conditioned."

▸ 2×2 matrix transform — watch the unit square move det = 1.00
Matrix M
Rotate 45° Scale 2× Shear Reflect
Ready
Edit the matrix entries or pick a preset. The dashed teal is the original unit square; solid violet is where M sends it.
det(M) is the signed area ratio. Negative means M flipped orientation.

Linear algebra is also the tool you'll reach for when you finally stop being afraid of the word "manifold" and start reading about Embeddings → and Self-Attention →.

Deep Dive — vectors, matrices, eigenvalues, SVD, PCA, full code →

Probability // from dice to Bayes

Probability is how you do math on uncertainty. Statistical mechanics, quantum measurement, clinical trials, and machine-learning loss functions all speak it.

Start with the simplest object: a random variable, which is just "a number whose value depends on chance." A fair die roll is a random variable with values 1..6. A user's click-through is a random variable with values 0 or 1. A model's predicted logit for "cat" is a random variable because the input image might be noisy. Probability assigns weights to the possible values — the distribution — and everything else follows.

Random variables, distributions, expectation

Distributions split into discrete (finitely many or countably many values) and continuous (a whole interval of values). Discrete ones you describe with a probability mass function that assigns a probability to each value. Continuous ones you describe with a probability density, which is not a probability — it's a number you integrate over a region to get a probability. This distinction trips everyone up on first meeting and is the reason log p(x) in a diffusion model paper can be negative: densities can be bigger than 1.

The most important summary of a distribution is its expected value, written E[X] — the average value you'd see if you drew from the distribution many times. The expectation is the probabilistic version of "center of mass," and most loss functions in ML are expectations of something.

Bayes' theorem — updating belief

The single most important formula in probability, written in one line: P(A | B) = P(B | A) · P(A) / P(B). Read it out loud: the probability of A given that B happened is the probability of B given A, scaled by how likely A was to begin with, normalized by how likely B was overall. Bayes' theorem is how you flip a conditional probability. It's also the foundation of every spam filter, medical-test interpretation, and Bayesian neural network. Most people's intuition fails at Bayesian reasoning on the first try — the classic "a 99% accurate medical test comes back positive; what's the chance you actually have the disease?" is almost always answered incorrectly by people who have never seen the calculation.

P(A | B)   =   P(B | A) · P(A)   /   P(B)

Bayes' theorem

P(A | B)
The posterior. The probability of A after we've learned that B happened. This is usually what we want to know.
P(B | A)
The likelihood. If A is really the case, how likely was it that we'd have seen B?
P(A)
The prior. How likely we thought A was before we saw any evidence.
P(B)
The evidence or marginal. How likely B is overall, averaging over whether A is true or not. Often computed as P(B|A)P(A) + P(B|¬A)P(¬A).

Analogy. Imagine 10,000 people. One percent (100) have the disease. A test with 99% sensitivity and 99% specificity flags 99 of the sick and 99 of the healthy. If your test is positive, you're one of 198 people — and only 99 of them are sick. Your posterior is 50%, not 99%.

Entropy, cross-entropy, KL

Entropy, H(p) = −Σ p(x) log p(x), measures how uncertain a distribution is. A fair coin has maximum entropy (one bit). A loaded coin has less. Cross-entropy between two distributions — the actual loss function used to train every language model ever — measures how surprised you'd be if you used distribution q to encode samples that really came from p. KL divergence, D_KL(p || q) = Σ p log(p/q), is cross-entropy minus entropy and measures "how far apart" two distributions are. It shows up in variational inference, in diffusion model losses, and in the RL objective PPO minimizes. All three are the same idea in different clothes.

▸ Bayes updater — diagnostic test P(sick | +) = ?
Population of 10,000 people
Inputs
COVID-like Rare disease Common disease
Ready
Press Step → to walk through the Bayes calculation one term at a time.
Blue dots are sick, grey dots are well. After stepping you'll see who tested positive.

Deep Dive — axioms, random variables, Bayes, all the distributions →  ·  Diffusion Math →

Statistics // from data to claims

Probability asks "given this model, what data might we see?" Statistics asks the backwards question: "given this data, which model should we believe?"

Statistics is the art of turning a pile of observations into a defensible claim. There are two philosophical camps. The frequentist camp treats parameters as fixed unknown constants and asks questions like "over many repetitions of the experiment, how often would my procedure get the wrong answer?" The Bayesian camp treats parameters as random variables with priors and uses Bayes' theorem to update those priors given data. The two camps mostly agree on what to do when the data is large, and mostly disagree when it's small. Modern ML borrows tools from both.

Estimation and MLE

Suppose you have a bag of coin flips and you want to estimate the coin's bias. The maximum likelihood estimator (MLE) picks the bias value that makes the observed flips most probable. For a coin, the MLE is just "fraction of heads," which is a relief. But the recipe generalizes: write down the likelihood of your data as a function of the parameters, differentiate with respect to the parameters, set to zero, solve. Nearly every loss function in ML is, under the hood, a negative log likelihood. Training a model by gradient descent is running MLE, one tiny step at a time.

Hypothesis testing and confidence

A hypothesis test asks: "is the pattern I see in the data real, or did I just get unlucky?" The classical recipe — pick a null, compute a p-value, compare to 0.05 — is widely misused but it exists for a reason: it's a disciplined way to say "I can't rule out chance." Confidence intervals are the same idea rotated: instead of yes/no, they give you a range of parameter values the data is consistent with. If your model's accuracy is 72% ± 5%, that interval is a confidence interval, and it tells you whether to trust that 72%.

Linear regression and its children

Linear regression — fitting a straight line to data — is the oldest statistical model and still one of the most useful. It has a closed-form solution that falls out of linear algebra ((XTX)⁻¹ XT y, the "normal equations"), and it is the mathematical ancestor of almost every ML model. Logistic regression adds a sigmoid to the output. A neural network is a stack of nonlinear regressions. XGBoost is hundreds of tiny regressions ensembled. If you understand linear regression, bias-variance, and the geometry of "projecting onto the column space of X," you have a real foothold.

▸ Gaussian MLE — fit μ and σ to 8 samples log L = ?
Gaussian parameters
Jump to MLE Bad fit
Ready
Samples: fixed list of 8 numbers. Adjust μ and σ to maximize the log-likelihood.
Tall vertical bars = high density at each sample. Peak density under the samples = MLE.

Deep Dive — sampling, MLE, hypothesis tests, regression, Bayesian vs frequentist →

Discrete Mathematics // the grammar of structure

Sets, logic, proofs, counting, graphs. The subject that teaches you to reason about structure, whether it's a molecule, a supply chain, or an algorithm.

Continuous math — calculus, probability, linear algebra — is about quantities that can slide smoothly. Discrete math is about things you can count: sets, permutations, logical statements, graph vertices. It's where computer science gets its backbone. Every proof of algorithm correctness, every induction argument, every complexity analysis, and every crypto protocol lives in discrete-math territory.

Logic and proof by induction

Start with propositional logic — AND, OR, NOT, IMPLIES — and the observation that "truth tables" mechanically verify whether a compound statement is a tautology. Add quantifiers (∀ "for all" and ∃ "there exists") and you have predicate logic, the language every math paper is secretly written in. The most useful proof technique is mathematical induction: prove a statement for n = 0 or 1, then prove that "if it holds for n, it holds for n+1." Bingo, it holds forever. Induction is how you prove that quicksort sorts, that a recursive function terminates, and that every tree with n nodes has n−1 edges.

Combinatorics — counting without listing

Combinatorics is the art of counting configurations without enumerating them. The core formulas — permutations n!, combinations C(n, k) = n! / (k!(n−k)!), the inclusion-exclusion principle, pigeonhole — look like children's math, but they underlie every probability calculation, every hash-collision analysis, and every "how many ways can we partition..." question. If you're calibrating a load-balancer or computing an expected number of collisions, you're doing combinatorics whether you call it that or not.

Graph theory — dots and lines

A graph is a set of vertices connected by edges. Graphs are how you model social networks, road maps, dependency trees, circuit diagrams, Markov chains, web links, and neural networks. The algorithms you meet first — BFS, DFS, shortest path, minimum spanning tree, topological sort — are the bread and butter of CS, and their correctness proofs use exactly the induction and counting arguments from above. Graph theory also sneaks into ML: attention heads define an implicit graph over tokens, graph neural networks are a thing, and the connectivity of a neural network's computation graph determines how backprop proceeds.

Finally there's a thin slice of number theory — modular arithmetic, primes, Fermat's little theorem, the extended Euclidean algorithm — that was niche until public-key cryptography made it essential. If you ever need to understand why RSA works, you need number theory.

▸ BFS vs DFS — step through a small graph Step 0 / 6
Traversal
BFS DFS
Start node: A. Watch the frontier (blue) and visited (grey) sets evolve.
Ready
Press Step → to visit the next node. Press Run All to watch the whole traversal.
Queue (BFS) or stack (DFS) shown at the bottom of the graph.

Deep Dive — sets, logic, proofs, combinatorics, graphs, number theory →  ·  CS: Algorithms →

Numerical Analysis // when exactness breaks

Real computers don't do real math. They do floating-point math, which is close enough most of the time — until it isn't.

Every other subject on this page assumes you can compute exact real numbers. Real computers can't. They use floating-point numbers, which store a sign, a mantissa, and an exponent. The result is a finite approximation to the reals that is extremely good most of the time and catastrophically wrong some of the time. Numerical analysis is the subject that teaches you which is which.

Floating point and catastrophic cancellation

A 64-bit double gives you about 15–16 decimal digits of precision. That sounds like a lot, but two disasters are waiting. First, catastrophic cancellation: if you subtract two numbers that are almost equal, nearly all the significant digits cancel and you're left with garbage. The classic example is computing sin(x) − sin(x + ε) directly versus using a trig identity — the identity version gives 10 digits of precision, the direct version gives 0. Second, overflow and underflow: multiply too many small probabilities together and you get zero; add too many big logits and you get inf. Both are why ML libraries compute log_softmax with a stabilization trick and why losses in mixed precision sometimes NaN out of nowhere.

Iterative solvers and Newton's method

If you can't solve an equation exactly, you solve it approximately by iteration. Newton's method is the cleanest example: to find a root of f(x) = 0, start with a guess x₀, pretend f is linear (use its tangent line — see derivatives), solve the linear approximation exactly, and call that your next guess. Repeat. When it works, Newton's method converges quadratically — the number of correct digits doubles each step. When it doesn't, it diverges wildly. Knowing the difference is what "numerical" in "numerical analysis" really means.

Quadrature and conditioning

Quadrature is the numerical side of integration: approximate ∫ f(x) dx by evaluating f at a few cleverly-chosen points and summing with weights. The trapezoid rule and Simpson's rule are the names you'll see first; Gaussian quadrature is the fancy one that nails the answer to full precision with a handful of evaluations. Conditioning, finally, is a measure of how sensitive a problem is to small input errors. A well-conditioned matrix can be inverted reliably; an ill-conditioned one will amplify the tiniest floating-point error into a useless answer. Every "why is my training unstable" story eventually involves conditioning.

▸ Newton's method — computing √a iter 0 · x = ?
Find √a via f(x)=x²−a
√2 from 3 √9 from 4 bad start
Ready
xₙ₊₁ = xₙ − f(xₙ) / f'(xₙ)   =   xₙ − (xₙ² − a) / (2xₙ)   =   ½ ( xₙ + a / xₙ )
That last form is the ancient "Babylonian method" for square roots — Newton's method rediscovered it 2000 years later.
Trap

Never test floating-point numbers for equality. a == b is a bug waiting to fire. Use |a − b| < ε with a tolerance appropriate to your scale. This is the single most common numerical mistake in production code.

Deep Dive — floating point, error analysis, root-finding, quadrature, ODEs →

Where to go next // exits

If you read this page top to bottom, you now have a working map of the math under modern AI, ML, and CS. Each subject has a deep-dive page waiting, each with worked examples, proofs, and code. From there the natural next moves are:

  • Into AI/ML. The AI/ML landscape — the track that this math supports. Start with Backpropagation → if you want to see the calculus pay off.
  • Into CS. The CS track — algorithms, complexity, optimization. CS: Algorithms → builds directly on the discrete math section above.
  • Sideways into a specific subject. Pick the deep-dive link at the end of any section and spend an hour on it. Come back and tackle another when it's settled.

Math rewards patience. Nothing on this page has to click on the first read. The goal is that when you meet these words in the wild, they no longer look like a foreign language.