Probability
Probability is the mathematics of uncertainty. It gives you a language for "how much should I believe this?" and a ruleset for updating that belief when new evidence arrives. By the end of this page you should be able to read any modern ML paper that talks about distributions, losses, or sampling without flinching.
This is a long page. It is structured so you can stop at any section and still have something usable. Sections 1–5 are the absolute core (sample spaces, axioms, conditional probability, Bayes). Sections 6–9 add the vocabulary of random variables. Sections 10–14 are what you need to read ML papers. The rest is worked examples and code.
1. Why probability
You already reason probabilistically all day. When you check the weather, decide whether to bring an umbrella, estimate how long a commute will take, or guess why your build is failing — you are doing informal inference under uncertainty. Probability is the formal version of that reasoning. It is the language in which you can say "how sure am I?" without waving your hands.
Three things probability is especially good at:
- Describing uncertainty. Not every quantity you care about has a single true value you can read off. Measurements are noisy, data is incomplete, and the future hasn't happened yet. Probability gives you a way to hold a whole range of possibilities at once and weigh them by plausibility.
- Making decisions. Once you can quantify "how likely" and "how costly," you can rank actions by expected value. Every practical decision — from medical treatment to portfolio allocation to whether to ship a build — is, secretly, an expected-value calculation.
- Learning from data. The entire machinery of modern machine learning is built on probability. Every loss function you've ever minimized is, in disguise, the negative log-likelihood of some probabilistic model. Cross-entropy, mean squared error, KL divergence — all probability.
2. Sample spaces and events
To talk about probability you need to pin down what "something could happen" even means. The setup has three pieces.
The sample space, usually written $\Omega$ (capital omega), is the set of every possible outcome of the random process you care about. For one coin flip, $\Omega = \{\text{H}, \text{T}\}$. For one die roll, $\Omega = \{1, 2, 3, 4, 5, 6\}$. For the temperature at noon tomorrow, $\Omega$ is the real numbers (or at least some reasonable subset). The only rule is that exactly one element of $\Omega$ happens per run of the experiment.
An outcome, written $\omega$ (lowercase omega), is a single element of $\Omega$ — one specific thing that happened. A single coin flip that came up heads is $\omega = \text{H}$.
An event, usually a capital letter like $A$ or $B$, is a set of outcomes — a question you can ask about the experiment that has a yes/no answer. "The die came up even" is the event $A = \{2, 4, 6\}$. "At least one head in two coin flips" is $A = \{\text{HH}, \text{HT}, \text{TH}\}$. Events are subsets of $\Omega$, and the three set operations you already know translate cleanly to English:
- $A \cup B$ — the union, "A or B (or both) happened."
- $A \cap B$ — the intersection, "both A and B happened."
- $A^c$ (or $\bar{A}$) — the complement, "A did not happen."
- $\emptyset$ — the empty set, an impossible event.
- $\Omega$ itself — the certain event, something in $\Omega$ must happen.
Two events are disjoint (or mutually exclusive) if $A \cap B = \emptyset$ — they can't both happen. "The die came up 2" and "the die came up 5" are disjoint. "The die is even" and "the die is prime" are not, because 2 is in both.
Mental model. The sample space is the board. Outcomes are the squares. Events are regions you can draw on the board. Probabilities are the "amount of paint" you spread over each region. The axioms in the next section are just the rules for how the paint is allowed to be spread.
3. Kolmogorov's axioms
In 1933 Kolmogorov wrote down the whole of probability theory in three axioms. A probability measure $P$ is a function that assigns a real number $P(A)$ to each event $A$ in your sample space, satisfying:
Kolmogorov's three axioms
- $P$
- The probability measure — a function that takes an event (a set of outcomes) and returns a real number in $[0, 1]$.
- $A$, $A_i$
- Events — subsets of the sample space $\Omega$.
- $P(A) \ge 0$
- Non-negativity. Probabilities can't be negative. There is no such thing as "minus twenty percent chance of rain."
- $P(\Omega) = 1$
- Normalization. The probability that something in the sample space happens is exactly 1. Your paint has to cover the whole board.
- $\bigcup_{i=1}^\infty A_i$
- The union of infinitely many events — "at least one of them happens."
- $\sum_{i=1}^\infty P(A_i)$
- The sum of their individual probabilities.
- disjoint
- "Non-overlapping." No two of the $A_i$ can both happen at the same time.
- axiom (3)
- Countable additivity. For non-overlapping events, the probability of "any of them" is the sum of their individual probabilities. This is the only non-obvious axiom and it's what lets you add up paint without double counting.
Why so minimal? From these three rules you can derive every classical result in probability — $P(A^c) = 1 - P(A)$, $P(A \cup B) = P(A) + P(B) - P(A \cap B)$, monotonicity, the whole thing. Kolmogorov's trick was to stop arguing about what probability "really is" (frequency? belief? a limit?) and just say: here are the rules any reasonable probability must obey. Everything else is a consequence.
A few consequences worth naming, because you'll use them constantly:
- Complement rule. $P(A^c) = 1 - P(A)$. If there's a 30% chance of rain, there's a 70% chance of no rain.
- Inclusion–exclusion. $P(A \cup B) = P(A) + P(B) - P(A \cap B)$. Add, then subtract the overlap you double-counted.
- Monotonicity. If $A \subseteq B$ then $P(A) \le P(B)$. A smaller set can't have more probability than a bigger one that contains it.
- Upper bound. $P(A) \le 1$ for every event. This falls out of normalization plus non-negativity of the complement.
4. Conditional probability
So far every event stands alone. But the interesting questions almost always have the form "given that X happened, how likely is Y?" That's conditional probability, written $P(A \mid B)$ and read "the probability of A given B."
Conditional probability
- $P(A \mid B)$
- The probability that event $A$ happens, given that we already know $B$ happened. The vertical bar $\mid$ is read "given."
- $P(A \cap B)$
- The probability that both $A$ and $B$ happen — the overlap.
- $P(B)$
- The probability of the event we're conditioning on.
- $P(B) > 0$
- A technicality: you can't condition on something that was impossible to begin with, because you'd be dividing by zero.
Paint intuition. Conditioning on $B$ is "zooming in" on the region $B$ on the board and asking what fraction of that region is also in $A$. You throw away everything outside $B$ and renormalize so that $B$ is the new "certain event." The ratio $P(A \cap B) / P(B)$ is literally "how much of B's paint is also A's."
Rearranging the definition gives the multiplication rule:
Multiplication rule
- $P(A \cap B)$
- Joint probability — both events occurring.
- $P(A \mid B)\, P(B)$
- First $B$ happens (with probability $P(B)$), then $A$ happens given $B$ (with probability $P(A \mid B)$). Their product is the joint.
Why care. This is the single identity from which Bayes' theorem falls out in two lines. Memorize this one.
Independence is the special case where conditioning on $B$ tells you nothing new about $A$:
Independence
- $A \perp B$
- Shorthand for "$A$ is independent of $B$." Some texts write $A \!\perp\!\!\!\perp B$ (two bars) instead.
- $P(A \mid B) = P(A)$
- Knowing $B$ didn't change your belief about $A$.
- $P(A \cap B) = P(A)\, P(B)$
- Equivalent statement: for independent events, the joint is the product of the marginals.
Warning. "Independent" is a mathematical statement, not a physical one. Two events can be physically caused by the same thing and still be statistically independent; two events can be physically unrelated and still be statistically dependent through a common ancestor in your sample space. The math only cares about $P$, not about the story you tell about it.
5. Bayes' theorem
This is the single most important formula on the page. It's what lets you turn evidence into updated beliefs, and it underlies every probabilistic inference technique you will ever use.
The derivation is one line. Start from the multiplication rule above, which says the joint $P(A \cap B)$ can be written two ways: $P(A \mid B)\, P(B) = P(B \mid A)\, P(A)$. Divide both sides by $P(B)$ and you get Bayes' theorem:
Bayes' theorem
- $P(A)$
- Prior. What you believed about $A$ before you saw any evidence.
- $P(B \mid A)$
- Likelihood. How probable the evidence $B$ would be if $A$ were true. This is a function of $A$ for fixed $B$, not a probability in $A$.
- $P(B)$
- Evidence (or marginal likelihood). The total probability of observing $B$ under any hypothesis, computed as $P(B) = P(B \mid A)P(A) + P(B \mid A^c)P(A^c)$. It's the normalizing constant that makes the whole thing add up to 1.
- $P(A \mid B)$
- Posterior. Your updated belief about $A$ after seeing the evidence $B$. This is what you actually care about.
The slogan. Posterior is proportional to likelihood times prior. In symbols, $P(A \mid B) \propto P(B \mid A)\, P(A)$. The evidence $P(B)$ is just the constant that rescales everything to sum to 1. Most of Bayesian inference in practice is computing the top of the fraction and worrying about the bottom later.
The disease-test example
This is the classic Bayes exercise. The numbers are engineered to shock people on first read, and they should.
A disease affects 1 in 1000 people. There is a test that is 99% accurate in both directions: if you have the disease it comes back positive 99% of the time, and if you don't have the disease it comes back negative 99% of the time. You take the test. It's positive. What's the probability you actually have the disease?
Naive intuition: "the test is 99% accurate, so I'm about 99% sure I have it." Let's see what Bayes says. Let $D$ = "you have the disease" and $+$ = "the test is positive."
Disease-test Bayes
- $P(D) = 0.001$
- Prior: the base rate of the disease (1 in 1000).
- $P(D^c) = 0.999$
- Prior that you don't have it.
- $P(+ \mid D) = 0.99$
- Likelihood: test is 99% sensitive — true positive rate.
- $P(+ \mid D^c) = 0.01$
- The false positive rate: 1% of healthy people also test positive.
- $P(+)$
- The total probability of a positive test — this is the sum in the denominator, computed using the law of total probability.
Why this formula has two terms in the denominator. You need $P(+)$, but $+$ can happen two ways: either you have the disease and the test caught it, or you don't and the test gave a false alarm. You add those two mutually exclusive paths to get the total.
Plug in the numbers:
The answer
- $0.00099$
- The probability of "truly sick AND tested positive."
- $0.01098$
- The total probability of any positive test — sick-and-positive plus healthy-and-false-positive.
- $0.090$
- About 9%. Despite the test being 99% accurate, a positive result means you have only a 9% chance of actually being sick.
Why this is counterintuitive. The rare disease is rare. In a population of 100,000 people, about 100 have it (all of whom test positive), but 999 × 100 = 999 healthy people also test positive from the 1% false-positive rate. So out of ~1099 positive tests, only ~100 are real. Most positives are false alarms, not because the test is bad, but because the prior is tiny. This is called base-rate neglect and it is the single most common error doctors make when interpreting screening tests. Every time you see a Bayesian update, remember: the prior has teeth.
6. Random variables
Events are great for yes/no questions, but most of what you care about in practice is a number: the value of a measurement, the number of requests per second, the height of a person. Enter the random variable.
Example. Flip a coin twice. $\Omega = \{\text{HH}, \text{HT}, \text{TH}, \text{TT}\}$. Let $X$ be the number of heads. Then $X(\text{HH}) = 2$, $X(\text{HT}) = X(\text{TH}) = 1$, $X(\text{TT}) = 0$. The random variable $X$ has turned the messy outcome space into the clean set $\{0, 1, 2\}$.
Random variables come in two flavors.
Discrete random variables and the PMF
A discrete random variable takes values in a countable set — finite or countably infinite. Its distribution is described by a probability mass function (PMF):
Probability mass function
- $p_X(k)$
- The probability that the random variable $X$ takes the specific value $k$.
- $P(X = k)$
- Shorthand for $P(\{\omega : X(\omega) = k\})$ — the probability of the event "$X$ equals $k$."
- rule
- Must satisfy $p_X(k) \ge 0$ and $\sum_k p_X(k) = 1$. The masses are non-negative and add up to 1.
Paint intuition. You have one unit of paint. You drop a lump of size $p_X(k)$ on each value $k$. The lumps sum to 1.
Continuous random variables and the PDF
A continuous random variable can take any value in an interval or in all of $\mathbb{R}$ — uncountably many options. That breaks the PMF approach, because for a continuous $X$ the probability of hitting any exact value is zero. Instead you have a probability density function (PDF) $f_X(x)$, and probabilities come from integrating it:
Probability density function
- $f_X(x)$
- The density of probability at the point $x$ — not a probability itself. It can be bigger than 1. The units are "probability per unit x."
- $\int_a^b f_X(x)\, dx$
- The area under the density curve between $a$ and $b$. See calculus.html if the integral notation is rusty — it's the continuous cousin of summing.
- total mass
- Must satisfy $f_X(x) \ge 0$ and $\int_{-\infty}^{\infty} f_X(x)\, dx = 1$.
- $P(X = x) = 0$
- For continuous $X$, the probability of hitting any single point exactly is zero. Probability lives on intervals, not on points.
Density, not probability. Height doesn't tell you probability; area does. A PDF can spike to 10 at $x = 0$ and that's fine, because "how much paint is near $x=0$" is $f(0) \cdot dx$, a product with a tiny $dx$. The only rule is that the total area under the curve is 1.
The CDF — the universal description
The cumulative distribution function works for any random variable, discrete or continuous:
Cumulative distribution function
- $F_X(x)$
- The probability that $X$ is at most $x$ — "how much paint is to the left of $x$?"
- monotone
- Non-decreasing: as $x$ increases, $F_X(x)$ can only go up or stay flat.
- limits
- $F_X(-\infty) = 0$ and $F_X(+\infty) = 1$. At the far left you've collected no paint; at the far right you've collected all of it.
- relation to PDF
- For continuous $X$, $f_X(x) = F_X'(x)$ — the density is the derivative of the CDF. For discrete $X$, the CDF is a staircase that jumps by $p_X(k)$ at each value $k$.
Why CDFs matter. They let you talk about discrete and continuous distributions with the same language. Also, sampling from an arbitrary distribution is often done by inverting the CDF (the "inverse CDF trick").
7. Common discrete distributions
Four distributions cover 90% of the discrete cases you'll meet in practice. Memorize their PMFs, means, and variances — or at least learn to recognize them on sight.
Bernoulli — the coin flip
Bernoulli distribution
- $X \sim \text{Bernoulli}(p)$
- $X$ is distributed as a Bernoulli with success probability $p$. The tilde $\sim$ is read "is distributed as."
- $p \in [0, 1]$
- The probability of "success" ($X = 1$). "Success" and "failure" are just labels; they don't have to correspond to anything good.
- $k \in \{0, 1\}$
- The only two possible outcomes.
- $\mathbb{E}[X] = p$
- The mean.
- $\operatorname{Var}(X) = p(1-p)$
- The variance. Maximized at $p = 1/2$, where the flip is maximally uncertain.
Where you'll see this. Every binary classification problem. Every dropout mask. Every "did this click happen?" A/B test. The humblest distribution, used constantly.
Binomial — how many successes in n trials
Binomial distribution
- $n$
- The number of independent Bernoulli($p$) trials.
- $k$
- The number of successes observed.
- $\binom{n}{k}$
- The binomial coefficient "$n$ choose $k$" — the number of ways to pick which $k$ of the $n$ trials were the successful ones. Equals $n! / (k!(n-k)!)$.
- $p^k (1-p)^{n-k}$
- The probability of any one specific sequence with $k$ successes and $n-k$ failures.
- $\mathbb{E}[X] = np$
- Mean number of successes. Sum of means of $n$ Bernoullis.
- $\operatorname{Var}(X) = np(1-p)$
- Variance.
The sum of Bernoullis. If $X_1, \ldots, X_n$ are i.i.d. Bernoulli($p$), then $\sum_i X_i \sim \text{Binomial}(n, p)$. "Binomial" is just "total successes in $n$ flips."
Geometric — how many trials until the first success
Geometric distribution
- $k$
- The trial number on which the first success occurred.
- $(1-p)^{k-1}$
- Probability that the first $k-1$ trials all failed.
- $p$
- Probability that trial $k$ finally succeeded.
- $\mathbb{E}[X] = 1/p$
- Mean wait time. If the chance per trial is 1/6, it takes 6 trials on average.
- $\operatorname{Var}(X) = (1-p)/p^2$
- Variance.
Memoryless. Geometric has a famous property: no matter how long you've waited, the probability of waiting another $m$ more trials is the same as if you'd just started. The distribution has no "I'm due for a win" logic.
Poisson — rare events over time or space
Poisson distribution
- $\lambda$
- The rate parameter — the expected number of events per unit interval. Lambda is strictly positive.
- $k$
- The number of events observed in that interval.
- $e$
- Euler's constant $\approx 2.71828$. It appears because the Poisson is the $n \to \infty$, $p \to 0$ limit of Binomial($n, p$) with $np = \lambda$ fixed.
- $k!$
- Factorial: $k! = k(k-1)(k-2)\cdots 1$.
- $\mathbb{E}[X] = \lambda$
- Mean equals the rate.
- $\operatorname{Var}(X) = \lambda$
- Variance also equals the rate — a signature of the Poisson.
Typical uses. Customer arrivals per hour, typos per page, radioactive decays per second, requests per second at a server. Anything where lots of independent things could happen but each with tiny probability, and you're counting how many actually did.
8. Common continuous distributions
Uniform on an interval
Uniform distribution
- $a, b$
- The left and right endpoints of the interval, with $a < b$.
- $1/(b-a)$
- The constant density. Every point in $[a, b]$ is equally likely (in the density sense).
- $\mathbb{E}[X] = (a + b)/2$
- The midpoint.
- $\operatorname{Var}(X) = (b - a)^2 / 12$
- Grows with the square of the interval width.
The starting point. Every other continuous distribution can be simulated from uniform samples via the inverse-CDF trick. When you call random() in a programming language, you're sampling from $\text{Uniform}(0, 1)$.
Exponential — time between Poisson events
Exponential distribution
- $\lambda$
- Rate parameter, same meaning as the Poisson rate.
- $\lambda e^{-\lambda x}$
- Density that peaks at $x = 0$ and decays exponentially. Most waits are short; long waits get rarer fast.
- $\mathbb{E}[X] = 1/\lambda$
- Mean wait time. If events happen at rate 3 per minute, you wait 1/3 minute on average.
- $\operatorname{Var}(X) = 1/\lambda^2$
- Variance.
Memoryless (again). Like the geometric, the exponential has no memory: $P(X > s + t \mid X > s) = P(X > t)$. The wait until the next event doesn't depend on how long you've already waited.
Gaussian / Normal — the most important distribution in the universe
Gaussian / Normal distribution
- $\mu$
- The mean — center of the bell curve. $\mathbb{E}[X] = \mu$.
- $\sigma$
- The standard deviation — width of the bell curve. Larger $\sigma$ means a wider, flatter bell.
- $\sigma^2$
- The variance, $\operatorname{Var}(X) = \sigma^2$.
- $\mathcal{N}$
- The calligraphic N that universally denotes Normal.
- $\exp$
- The exponential function $e^x$. The argument is a negative squared term, so the density is a bell that decays on both sides.
- $1 / (\sigma\sqrt{2\pi})$
- The normalization constant that makes the whole integral equal to 1.
Why it's everywhere. The Central Limit Theorem (§13) says that sums of independent things — no matter what distribution they started from — look Gaussian in the limit. That's why measurement noise, heights, test scores, and a thousand other real-world quantities are roughly Gaussian. And it's why half of ML starts with "assume the residual is Normal."
68–95–99.7 rule. About 68% of the mass is within $\mu \pm \sigma$, 95% within $\mu \pm 2\sigma$, and 99.7% within $\mu \pm 3\sigma$. Internalize these. They let you eyeball whether a number is "within normal" without computing anything.
9. Interactive: the Gaussian sandbox
Drag the sliders to change the mean $\mu$ and standard deviation $\sigma$ of a Gaussian and watch the PDF and CDF update together. The dashed lines mark the 68% interval $\mu \pm \sigma$. Notice how the PDF's height changes inversely with $\sigma$ — a wider bell is necessarily a shorter one, so the total area stays 1.
PDF (cyan) and CDF (violet) of $\mathcal{N}(\mu, \sigma^2)$.
10. Expectation and variance
Two numbers summarize a distribution well enough for most purposes: the center and the spread. The center is the expectation; the spread is the variance. These are the first two moments of the distribution.
Expectation
Expectation
- $\mathbb{E}[X]$
- The expected value or mean of the random variable $X$. The "blackboard bold" $\mathbb{E}$ is universal notation.
- $\sum_k k\, p_X(k)$
- For a discrete RV: sum each value times its probability. The "weighted average of outcomes, weighted by probability."
- $\int x\, f_X(x)\, dx$
- The continuous analog — the integral of $x$ weighted by the density. See calculus.html if the integral feels opaque.
Center of mass. If you printed the PMF or PDF on a cardboard and tried to balance it on a knife edge, the balance point would be $\mathbb{E}[X]$. It is not "the most likely value" (that's the mode) or "the median" — it's the balance point. For symmetric distributions like the Gaussian, all three coincide. For skewed distributions, they don't.
Linearity of expectation
The single most useful property of $\mathbb{E}$:
Linearity of expectation
- $X, Y$
- Any two random variables on the same sample space. They do not have to be independent.
- $a, b$
- Any real constants.
Why this is a superpower. Independence is a strong assumption. Linearity doesn't need it. You can compute the expected number of fixed points in a random permutation, or the expected number of empty bins in a hash table, by decomposing the quantity as a sum of indicator variables and adding their means. The variables are usually not independent, and it doesn't matter.
Proof sketch. For discrete variables, $\mathbb{E}[X + Y] = \sum_{\omega} (X(\omega) + Y(\omega)) P(\{\omega\}) = \sum_\omega X(\omega) P(\{\omega\}) + \sum_\omega Y(\omega) P(\{\omega\}) = \mathbb{E}[X] + \mathbb{E}[Y]$. Sums factor. The continuous proof is the same with integrals.
Variance and standard deviation
Variance
- $\operatorname{Var}(X)$
- Variance — the expected squared deviation from the mean. A measure of how spread out $X$ is.
- $(X - \mathbb{E}[X])^2$
- The squared gap between the random variable and its mean. Squared so negatives don't cancel positives.
- $\mathbb{E}[X^2] - \mathbb{E}[X]^2$
- An equivalent, often more convenient form. "Expectation of the square minus the square of the expectation." Remember this identity; it saves you all the time.
- standard deviation
- $\sigma(X) = \sqrt{\operatorname{Var}(X)}$. Same units as $X$, which is why you usually report $\sigma$ rather than $\operatorname{Var}$ in practice.
Not linear. $\operatorname{Var}(aX) = a^2 \operatorname{Var}(X)$ and $\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) + 2\operatorname{Cov}(X, Y)$. The variance of a sum is the sum of variances only when $X$ and $Y$ are uncorrelated.
The law of the unconscious statistician
You often want the expectation not of $X$ itself but of some function $g(X)$. The "LOTUS" rule says you don't need to compute the distribution of $g(X)$; you just weight $g(x)$ by $X$'s original density:
Law of the unconscious statistician
- $g$
- Any function you want to apply to the random variable — e.g., $g(x) = x^2$, $g(x) = \log x$.
- $\mathbb{E}[g(X)]$
- The expected value of the transformed variable.
- $g(k)\, p_X(k)$
- You evaluate $g$ at each outcome $k$ and weight by $k$'s original probability — no need to find the PMF of $g(X)$.
Why "unconscious"? The name is a joke: it's the formula beginners use without realizing they're implicitly using a theorem. The theorem is that this shortcut agrees with the "proper" calculation of first finding $g(X)$'s distribution and then computing its mean. It always does.
11. Joint, marginal, and conditional distributions
Once you have more than one random variable — say $X$ and $Y$ — three new objects appear, and it pays to keep them straight.
Joint. The joint distribution $p_{X,Y}(x, y)$ (or $f_{X,Y}$ in the continuous case) tells you the probability of every combination of values:
Joint distribution
- $X, Y$
- Two random variables on the same sample space.
- $p_{X,Y}(x, y)$
- The joint PMF (or density, in the continuous case) — how much probability mass sits at the point $(x, y)$.
- $(X = x, Y = y)$
- The event "$X$ equals $x$ and $Y$ equals $y$ simultaneously."
Picture. The joint is a heatmap over the plane; integrating/summing it gives 1.
Marginal. If you have the joint and you only care about $X$, you sum (or integrate) out $Y$:
Marginalization
- $p_X(x)$
- The marginal distribution of $X$ — what you get if you ignore $Y$ entirely.
- $\sum_y$ / $\int dy$
- Summing or integrating over all possible values of the variable you want to eliminate.
The word "marginal" is literal. Put the joint in a table. Sum each row; write the total in the right margin. Those numbers are the marginal distribution of the row variable. Same for columns.
Conditional. This is just the conditional probability rule applied to joint densities:
Conditional distribution
- $p_{X \mid Y}(x \mid y)$
- The distribution of $X$ given that you've observed $Y = y$.
- $p_{X,Y}(x, y)$
- The joint.
- $p_Y(y)$
- The marginal of $Y$ — the normalizer that rescales the slice so it sums to 1.
The whole game. Joint, marginal, conditional. Bayes' theorem, expectation, and every Bayesian inference algorithm you'll ever see is built out of just these three moves plus some bookkeeping.
Covariance and correlation
Covariance
- $\operatorname{Cov}(X, Y)$
- A measure of how $X$ and $Y$ vary together. Positive when they tend to be large/small at the same time, negative when one being large predicts the other being small.
- $(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])$
- Product of the deviations from each mean.
- units
- Covariance has the units of $X$ times the units of $Y$, which is awkward. That's why people prefer the dimensionless correlation.
Independence implies zero covariance. The converse is not true: you can have $\operatorname{Cov}(X, Y) = 0$ without $X$ and $Y$ being independent. Covariance only catches linear dependence.
Correlation
- $\rho$
- Pearson correlation coefficient. Bounded in $[-1, 1]$.
- $\sigma(X), \sigma(Y)$
- Standard deviations of $X$ and $Y$.
- $\rho = 1$
- Perfect positive linear relationship.
- $\rho = -1$
- Perfect negative linear relationship.
- $\rho = 0$
- Uncorrelated — not the same as independent.
12. Multivariate Gaussian
The single most important distribution in modern ML. Once you stack several Gaussian variables into a vector and allow them to be correlated, you get the multivariate normal:
Multivariate Gaussian
- $\mathbf{x} \in \mathbb{R}^d$
- A $d$-dimensional random vector. Bold lowercase for vectors.
- $\boldsymbol{\mu} \in \mathbb{R}^d$
- The mean vector — the center of the distribution in $d$ dimensions.
- $\boldsymbol{\Sigma}$
- The $d \times d$ covariance matrix. Entry $\Sigma_{ij} = \operatorname{Cov}(x_i, x_j)$. The diagonal holds the individual variances; the off-diagonals hold the pairwise covariances. See linear-algebra.html for matrices and quadratic forms.
- $|\boldsymbol{\Sigma}|$
- The determinant of the covariance matrix — a scalar measure of its "volume." Shows up in the normalizer.
- $\boldsymbol{\Sigma}^{-1}$
- The precision matrix — inverse of the covariance. It measures concentration, not spread: a large precision in a direction means the distribution is tight in that direction.
- $(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})$
- The Mahalanobis distance squared — a generalized "how far from the mean" that accounts for correlations between dimensions.
- $(2\pi)^d$
- Normalization factor that generalizes the $\sqrt{2\pi}$ from the 1D case.
Why so central. The multivariate Gaussian is closed under linear transformations, marginalization, and conditioning — meaning every projection, slice, or linear combination of a Gaussian is still Gaussian. Nothing else is this well-behaved. Half of classical ML (PCA, Kalman filters, Gaussian processes, LDA, variational inference) is built on top of this single fact.
13. Law of large numbers and the Central Limit Theorem
Two results that tie probability to reality and are the reason Gaussians show up everywhere.
The Law of Large Numbers (LLN)
If $X_1, X_2, \ldots$ are independent random variables drawn from the same distribution with mean $\mu$, and you take the sample mean $\bar{X}_n = \tfrac{1}{n} \sum_{i=1}^n X_i$, then:
Law of large numbers
- $X_i$
- Independent and identically distributed (i.i.d.) draws from the same distribution.
- $\bar{X}_n$
- Sample mean of the first $n$ draws.
- $\mu = \mathbb{E}[X_i]$
- True population mean.
- convergence
- The sample average converges to the true mean as $n$ grows. There are two precise versions (weak LLN and strong LLN) that differ in what kind of convergence they guarantee.
What this is saying. Averaging many independent measurements gets you closer to the truth. The reason Monte Carlo simulation works. The reason frequentist statistics makes sense. The reason "more data" is almost always better.
The Central Limit Theorem (CLT)
The LLN tells you where $\bar{X}_n$ ends up. The CLT tells you how it gets there — specifically, what the distribution of the residual error looks like.
Central Limit Theorem
- $\bar{X}_n$
- Sample mean of $n$ i.i.d. draws.
- $\mu, \sigma$
- Population mean and standard deviation (assumed finite).
- $\sqrt{n}\, (\bar{X}_n - \mu)/\sigma$
- The standardized error: gap from the truth, rescaled by the standard error $\sigma/\sqrt{n}$.
- $\mathcal{N}(0, 1)$
- Standard normal — mean zero, variance one.
- convergence
- "Converges in distribution to" — the CDF of the left side approaches the CDF of the right side everywhere.
Why this is magic. The $X_i$ can be from any distribution with finite variance. Uniform, exponential, Bernoulli, something weirder — doesn't matter. Their sample mean still looks Gaussian in the limit. This is why, even though the real world is full of non-Gaussian distributions, every averaged quantity from polls to thermometer readings is Gaussian in practice. The Gaussian is an attractor.
The size of the error. The standard error shrinks like $\sigma / \sqrt{n}$, not $1/n$. To cut your uncertainty in half, you need four times as much data. That $\sqrt{n}$ rate is the single most important scaling law in statistics.
14. Entropy, cross-entropy, KL divergence
Information theory is the bridge between probability and machine learning. Three quantities are essential.
Entropy
Entropy
- $H(p)$
- The entropy of the distribution $p$ — a measure of its uncertainty or "spread" in bits (if $\log$ is base 2) or nats (if $\log$ is natural).
- $p(x)$
- The probability the distribution assigns to outcome $x$.
- $-\log p(x)$
- The surprise of outcome $x$. Rare outcomes are surprising; certain outcomes are not. Surprise is never negative because $p(x) \le 1$ makes $\log p(x) \le 0$.
- summation
- Weighted average of surprise over the distribution — "how surprised do you expect to be?"
Information content. A fair coin has entropy $\log 2 = 1$ bit — one yes/no question per flip. A biased coin has less. A deterministic outcome has entropy 0 because nothing is left to learn. Entropy measures how much information is needed to describe a sample from $p$.
Cross-entropy
Cross-entropy
- $p$
- The true distribution of the data.
- $q$
- Your model's distribution — the one you're using to "encode" outcomes from $p$.
- $-\log q(x)$
- The surprise your model feels at seeing $x$.
- $\sum_x p(x)(-\log q(x))$
- Your model's average surprise when outcomes are actually drawn from $p$.
- $H(p, q) \ge H(p)$
- Always, with equality only when $p = q$. Your model is never less surprised than the best possible model.
Why every classifier uses it. When you train a classifier with the "cross-entropy loss," you are minimizing $H(p, q)$ where $p$ is the true one-hot label distribution and $q$ is your softmax output. It is the unique loss that makes your model's predicted distribution converge to the true conditional distribution given the input.
KL divergence
Kullback–Leibler divergence
- $D_{\text{KL}}(p \,\|\, q)$
- The "distance" from $p$ to $q$ — how many extra nats of surprise you incur by using $q$ when the truth is $p$.
- $\log (p(x) / q(x))$
- Log-ratio of the true and model probabilities. Positive where $p > q$, negative where $p < q$.
- $H(p, q) - H(p)$
- KL equals cross-entropy minus entropy. Since $H(p)$ doesn't depend on $q$, minimizing cross-entropy over $q$ is the same as minimizing KL.
- $D_{\text{KL}} \ge 0$
- Always non-negative, zero only when $p = q$ almost everywhere. This is Gibbs' inequality.
- not a metric
- $D_{\text{KL}}$ is asymmetric: $D_{\text{KL}}(p \| q) \ne D_{\text{KL}}(q \| p)$ in general. It's not a distance in the usual sense, despite the name.
Where you'll meet it. Variational inference minimizes KL between an approximate posterior and the true one. VAEs have a KL regularizer pulling the latent code toward a prior. DPO and RLHF use a KL term to keep the fine-tuned policy from drifting too far from the base model. Diffusion models train by matching a Gaussian reverse process to a KL-minimizing target. It is everywhere in modern ML.
15. Worked example: a Bayesian coin-flip
Let's do a full Bayesian update from start to finish. You have a coin. You suspect it might be biased. Before flipping, you believe the bias parameter $\theta$ (the probability of heads) is uniformly distributed on $[0, 1]$ — every value equally plausible.
Prior. $f(\theta) = 1$ for $\theta \in [0, 1]$.
You flip the coin 10 times and observe 7 heads. What should you believe about $\theta$ now?
Likelihood. Given $\theta$, the probability of seeing 7 heads in 10 flips is a Binomial probability — and viewed as a function of $\theta$ with the data fixed:
Coin-flip likelihood
- $\theta$
- The unknown bias parameter we're trying to learn.
- $\binom{10}{7}$
- The number of ways to arrange 7 heads in 10 slots. It's a constant in $\theta$ and will be absorbed into the normalizer.
- $\theta^7 (1 - \theta)^3$
- The "shape" of the likelihood — the part that actually depends on $\theta$.
Posterior. By Bayes,
Coin-flip posterior
- $f(\theta \mid \text{data})$
- The posterior — your updated density over $\theta$ after seeing the 10 flips.
- $\propto$
- "Proportional to." We're dropping the constants because they'll be set by requiring the density to integrate to 1.
- $\theta^7 (1 - \theta)^3$
- The unnormalized posterior — the shape of what you believe now.
It's a Beta distribution. $\theta^{a-1}(1-\theta)^{b-1}$ is the unnormalized form of Beta($a$, $b$). So your posterior here is Beta($8$, $4$). Its mean is $8/(8+4) = 2/3 \approx 0.667$. So after 7 heads in 10 flips, your best estimate of $\theta$ is 0.667, not 0.7 — because your uniform prior is still pulling the estimate slightly toward 0.5. The prior has teeth even here.
This is conjugacy. When the prior and posterior come from the same family (uniform is Beta(1,1); posterior is Beta($k+1$, $n-k+1$)), we say the prior is conjugate to the likelihood. This is the Bayesian freebie: the update is closed-form. For coin flips the Beta is conjugate to the Bernoulli/Binomial. Most real problems don't have conjugate priors, which is why people invent MCMC and variational methods.
16. Code — NumPy and SciPy
Three tabs: sampling from common distributions and summarizing them, a Bayesian update from the worked example, and computing KL divergence numerically.
import numpy as np
from scipy import stats
rng = np.random.default_rng(42)
# Draw 10,000 samples from each of three common distributions
bern = rng.binomial(n=1, p=0.3, size=10000) # Bernoulli(0.3)
pois = rng.poisson(lam=4.0, size=10000) # Poisson(lambda=4)
norm = rng.normal(loc=0.0, scale=1.0, size=10000) # N(0, 1)
for name, s, true_mean, true_var in [
("Bern(0.3)", bern, 0.3, 0.3 * 0.7),
("Pois(4.0)", pois, 4.0, 4.0),
("N(0,1)", norm, 0.0, 1.0),
]:
print(f"{name:10s} sample mean = {s.mean():+.3f} (true {true_mean:+.3f})"
f" sample var = {s.var():.3f} (true {true_var:.3f})")
# Use scipy.stats for PDFs / PMFs / CDFs directly
x = np.linspace(-4, 4, 200)
pdf = stats.norm.pdf(x, loc=0, scale=1) # Gaussian density
cdf = stats.norm.cdf(x, loc=0, scale=1) # Gaussian CDF
print(f"P(|X| < 1) for N(0,1) ≈ {stats.norm.cdf(1) - stats.norm.cdf(-1):.4f}")
# → 0.6827, the 68% rule.
import numpy as np
from scipy import stats
# Bayesian coin-flip update — §15 worked example.
# Prior: Uniform(0, 1) = Beta(1, 1).
# Data: 10 flips, 7 heads.
# Posterior: Beta(1 + heads, 1 + tails) = Beta(8, 4).
heads, tails = 7, 3
a_prior, b_prior = 1, 1
a_post, b_post = a_prior + heads, b_prior + tails
post = stats.beta(a_post, b_post)
print(f"posterior mean = {post.mean():.4f}") # 0.6667
print(f"posterior mode = {(a_post-1)/(a_post+b_post-2):.4f}") # 0.7000
print(f"95% credible iv = [{post.ppf(0.025):.3f}, {post.ppf(0.975):.3f}]")
# Disease-test Bayes — §5 worked example.
def disease_posterior(prior, sensitivity, specificity):
p_pos_given_d = sensitivity
p_pos_given_nd = 1 - specificity
p_pos = p_pos_given_d * prior + p_pos_given_nd * (1 - prior)
return p_pos_given_d * prior / p_pos
p = disease_posterior(prior=0.001, sensitivity=0.99, specificity=0.99)
print(f"P(disease | positive) = {p:.4f}") # 0.0901
import numpy as np
from scipy import stats
# KL divergence between two discrete distributions.
def kl_discrete(p, q, eps=1e-12):
p = np.asarray(p, dtype=float)
q = np.asarray(q, dtype=float)
return np.sum(p * np.log((p + eps) / (q + eps)))
p = np.array([0.1, 0.4, 0.5])
q = np.array([0.2, 0.3, 0.5])
print(f"KL(p || q) = {kl_discrete(p, q):.4f}")
print(f"KL(q || p) = {kl_discrete(q, p):.4f}") # different — KL is asymmetric
# Entropy and cross-entropy
def entropy(p, eps=1e-12):
p = np.asarray(p, dtype=float)
return -np.sum(p * np.log(p + eps))
def cross_entropy(p, q, eps=1e-12):
return -np.sum(np.asarray(p) * np.log(np.asarray(q) + eps))
print(f"H(p) = {entropy(p):.4f}")
print(f"H(p, q) = {cross_entropy(p, q):.4f}")
print(f"KL = H(p,q) - H(p) = {cross_entropy(p, q) - entropy(p):.4f}")
# KL between two Gaussians has a closed form:
# KL( N(mu1, s1^2) || N(mu2, s2^2) )
# = log(s2/s1) + (s1^2 + (mu1-mu2)^2) / (2 s2^2) - 1/2
def kl_gaussian(mu1, s1, mu2, s2):
return (np.log(s2 / s1)
+ (s1**2 + (mu1 - mu2)**2) / (2 * s2**2)
- 0.5)
print(f"KL(N(0,1) || N(1,1)) = {kl_gaussian(0, 1, 1, 1):.4f}")
17. Connections to machine learning
Everything in this page shows up, by name, in modern ML. A partial tour:
- Cross-entropy loss. Every classifier you've trained in the last decade minimizes $H(p_{\text{true}}, p_{\text{model}})$ — cross-entropy between the one-hot label and the softmax output. The softmax is literally just "turn logits into a categorical probability distribution," and cross-entropy is the natural loss for matching it to the target.
- Negative log-likelihood. The loss used to train language models (see foundation models) is $-\log p_\theta(x_t \mid x_{
- KL regularization in RLHF and DPO. Preference optimization adds a KL term $\beta \cdot D_{\text{KL}}(\pi_{\text{new}} \,\|\, \pi_{\text{ref}})$ that keeps the fine-tuned policy close to the base model. Without it, RLHF would happily collapse into gibberish that maximizes the reward model.
- VAEs and diffusion. Diffusion models and variational autoencoders both optimize an evidence lower bound that's a sum of a reconstruction term and a KL divergence between an approximate posterior and a prior.
- Dropout. Each neuron is "kept" with probability $p$. That's a Bernoulli mask applied element-wise to the hidden state — literally sampling from Bern($p$) during training.
- Softmax = categorical distribution. The softmax output layer is just a Categorical distribution parameterized by logits. Sampling from it is "sampling from a softmax."
- Reparameterization trick. VAEs sample $z = \mu + \sigma \cdot \epsilon$ with $\epsilon \sim \mathcal{N}(0, 1)$ so the gradient can flow through. That's the Gaussian and nothing more.
- Bayes in optimization. Bayesian optimization uses a Gaussian process prior over the objective function and a Bayes update after each evaluation to decide where to sample next.
If you only remember one thing from this page, let it be this. Every time you take a negative log, you are computing a surprise. Every time you sum surprises weighted by the truth, you are computing cross-entropy. Every time you train a neural network on a classification or language task, you are doing Bayesian inference in disguise — with a frequentist accent, but Bayesian machinery. The whole field rests on §5 and §14 of this page.
Further reading
- Sheldon Ross — A First Course in Probability. The standard undergraduate text. Clear, classical, lots of worked problems.
- Grimmett & Stirzaker — Probability and Random Processes. A step up in sophistication. Where to go when you want the measure-theoretic underpinnings without a full analysis course.
- David MacKay — Information Theory, Inference, and Learning Algorithms. The bridge from probability to machine learning. Free PDF on his website. Opinionated, Bayesian, wonderful.
- E. T. Jaynes — Probability Theory: The Logic of Science. The manifesto for probability as extended logic. Not a first book, but transformative as a second or third.
- Blitzstein & Hwang — Introduction to Probability. Modern, friendly, with a matching online course (Harvard Stat 110). Excellent intuition-building.