Statistics

Statistics is the art of working backwards from data to the unknown process that produced it. Probability tells you what data to expect from a known coin; statistics tells you what coin you've probably got after watching it flip. This page takes you from the mean of a small dataset all the way to why cross-entropy loss in a neural network is really just maximum likelihood wearing a funny hat.

Prereq: probability, calculus Read time: ~35 min Interactive figures: 1 Code: NumPy / SciPy

1. Why statistics — the bridge from data to claims

If you have read the probability page, you saw the forward story. You start with a known model — a fair coin, a Gaussian with known mean and variance, a roll of a twenty-sided die — and you compute what data it would produce. That direction is called generative. It's the direction physics and simulations love.

Real life almost never hands you the model. Real life hands you data. You get a spreadsheet, a log file, a clinical trial, a dataset of user clicks, or a batch of token-level losses from your latest language-model training run. Something generated that data, but you don't know what. Statistics is the inverse problem:

the fork in the road Probability. Given the process, predict the data. (Model → data.)
Statistics. Given the data, infer the process. (Data → model.)

That inversion is surprisingly hard because it's under-determined. Many different generating processes can produce the same-looking data, and many different datasets can be drawn from the same process. So statistics cannot ever hand you a single right answer. It hands you a best guess plus an honest description of how uncertain the guess is. That pairing — estimate and uncertainty — is the product. Everything on this page is a tool for producing one or the other.

You will use statistics any time you need to answer questions like these:

Every one of those is the same question wearing different clothes: given a finite sample, what can you honestly say about the underlying population? This is the question statistics answers, and the answer is always probabilistic.

A useful habit: whenever you see a number reported (accuracy, click-through, revenue, loss), silently ask two follow-up questions. "Compared to what?" and "How sure are you?" Statistics is the machinery that makes the second question answerable. The first question is just good taste.

2. Vocabulary cheat-sheet

Here's the minimal symbol table you'll see on this page. Keep this tab open; it will save you flipping back.

$X_1, X_2, \dots, X_n$
Random variables representing the outcomes of $n$ independent measurements. Before the data is collected they are uncertain; after, you get specific numbers $x_1, \dots, x_n$.
$n$
Sample size — how many data points you have.
$\bar{X}$
Sample mean: the average of $X_1, \dots, X_n$.
$\mu$
Population mean: the true, usually unknown, average over the whole population.
$\sigma$
Population standard deviation: the true spread.
$s$
Sample standard deviation: the spread estimated from the sample.
$\theta$
A generic parameter of a model — something you'd like to know.
$\hat{\theta}$
An estimator of $\theta$: a function of the data that approximates $\theta$. The hat means "estimated from data".
$\mathbb{E}[\cdot]$
Expected value. Average over the underlying probability distribution.
$\text{Var}[\cdot]$
Variance. Expected squared deviation from the mean.
$p$-value
Probability of seeing data at least as extreme as yours, assuming the null hypothesis is true. More on this later.

A small nomenclatural warning. Anywhere you see a Greek letter ($\mu$, $\sigma$, $\theta$), it is almost always a population parameter — unknown, theoretical, something you'd like to pin down. Anywhere you see a Latin letter with a hat or a bar ($\bar{X}$, $\hat{\theta}$, $s$), it is a function of your actual data — known, computable, and used to estimate its Greek counterpart. This Greek-vs-Latin convention is not universal but it is nearly universal, and keeping it straight will save you a lot of confusion.

One more notational quirk you'll see. Capital letters like $X$ are random variables (not-yet-observed), lower-case $x$ are specific realized values (already-observed). So when we write $\mathbb{E}[\bar{X}] = \mu$, we're making a statement about the sample mean as a random variable, before we see the data. After you see the data, $\bar{X}$ becomes a specific number $\bar{x}$, and the statement stops being a probability claim. It's a subtle but useful distinction — especially when reading older texts where the convention is enforced rigorously.

3. Descriptive statistics — summarizing what you have

Before you infer anything, you describe. Descriptive statistics is the honest first pass: compress a dataset of a thousand numbers into a handful of summaries that capture "where the data sits" and "how spread out it is". No probability, no model, no claim about the world — just an accurate summary of the numbers in front of you.

3.1 Measures of center

Given a dataset $x_1, x_2, \dots, x_n$, the three classic measures of "typical value":

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

Sample mean (average)

$\bar{x}$
The sample mean. Sometimes written $\bar{X}$ when we mean it as a random variable and $\bar{x}$ when we mean the specific number we computed.
$n$
Sample size: how many numbers you have.
$x_i$
The $i$-th observation, for $i$ running from 1 to $n$.
$\sum_{i=1}^{n}$
Sum over all $n$ observations.
$\frac{1}{n}$
Divide by the count to get an average rather than a total.

Intuition. If you piled all the observations on a seesaw placed along the number line, the mean is the pivot point where it balances. It's the unique value that makes the total "signed distance" from itself equal to zero.

The median is the middle value once you sort the data. Half the points are below it, half above. For an even $n$, average the two middle values. The mode is the most common value — useful mostly for categorical data.

mean vs median trap The mean gets pulled around by outliers. A single billionaire in a room of janitors raises the mean income by a fortune but leaves the median almost untouched. Whenever someone reports "average income", "average response time", or "average click-through rate", your first question should be "do you mean the mean or the median?" — they can be wildly different.

3.2 Measures of spread

Knowing the center isn't enough. Two datasets can share the same mean yet look nothing alike — one tightly clustered, one wildly scattered. You need a spread measure. The canonical one is variance:

$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

Sample variance

$s^2$
The sample variance — average squared distance from the mean.
$(x_i - \bar{x})^2$
Squared deviation of observation $i$ from the mean. Squaring removes sign, so overshoots and undershoots don't cancel.
$n - 1$
Divisor. Not $n$! See the callout below for the reason.
$s$
The sample standard deviation: just $\sqrt{s^2}$. Same units as the original data, which is why you usually report it instead of variance.

Why square? If you just averaged $(x_i - \bar{x})$ you'd always get zero, because the positive and negative deviations cancel exactly (that's what being the mean means). Squaring keeps all deviations positive and also punishes large deviations more than small ones, which gives variance a clean connection to Gaussian distributions and least-squares optimization.

why n − 1, not n?

This is the famous "Bessel's correction". The story: you don't know the true population mean $\mu$, so you use the sample mean $\bar{x}$ as a stand-in. But $\bar{x}$ is pulled toward the data — by construction it sits right in the middle of your sample — so the deviations $(x_i - \bar{x})$ are a little smaller than the deviations from the true $\mu$ would be.

If you divided by $n$, you'd systematically under-estimate the variance. Dividing by $n-1$ exactly undoes the bias. The intuition: you used one "degree of freedom" on the data to compute $\bar{x}$, so only $n-1$ independent deviations remain.

For large $n$ it barely matters. For small $n$ it matters a lot. NumPy's np.var defaults to $n$; use ddof=1 to get the unbiased $n-1$ version.

3.3 Quantiles and the IQR

For skewed or heavy-tailed data, mean and variance can be misleading. Quantiles are robust alternatives. The $q$-th quantile is the value below which a fraction $q$ of the data lies. The 0.5 quantile is the median. The 0.25 and 0.75 quantiles are the first and third quartiles, $Q_1$ and $Q_3$. The interquartile range is:

$$\text{IQR} = Q_3 - Q_1$$

Interquartile range

$Q_1$
First quartile: the value such that 25% of the data is below it.
$Q_3$
Third quartile: the value such that 75% of the data is below it.
$\text{IQR}$
The middle 50% of the data, measured as a width. Robust to outliers — a single extreme point can't move $Q_1$ or $Q_3$ much, whereas it can move the mean by a lot.

Why you care. IQR is the foundation of the boxplot and the usual "outlier rule" of thumb: anything more than $1.5 \times \text{IQR}$ outside the box is flagged as unusual. If your data is long-tailed — response times, file sizes, income, token counts — report medians and IQRs, not means and standard deviations.

4. Sample vs population — the fundamental distinction

Here is the single most important idea in statistics. Get this right and most of the rest clicks into place.

Population. The complete set of things you care about — every possible user, every possible email, every possible experimental run. Usually you cannot measure all of it.

Sample. The finite subset you actually observed. This is your data.

Parameter. A number that describes the population. Usually Greek. Usually unknown. Example: $\mu$, the true average height of adults in a country.

Estimator. A function of your sample that approximates a parameter. Usually Latin with a hat. Usually computable. Example: $\bar{X}$, the average height of the 500 people you actually measured.

An estimator is itself a random variable — not a fixed number. Why? Because if you collected a different sample from the same population, you'd get a different value. Every time you sample, $\bar{X}$ jumps around. Across all possible samples of size $n$, $\bar{X}$ has its own distribution. We'll meet that distribution in the next section; it's the heart of almost everything.

4.1 Bias of an estimator

An estimator is unbiased if, on average across all possible samples, it hits the right answer:

$$\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta$$

Bias of an estimator

$\hat{\theta}$
Your estimator — a function of the data that spits out a number. Example: $\bar{X}$ estimating $\mu$.
$\theta$
The true parameter you'd like to know. Unknown but fixed.
$\mathbb{E}[\hat{\theta}]$
The expected value of the estimator, averaged across all possible samples of the same size drawn from the same population.
$\text{Bias}$
The systematic error. Zero means the estimator is centered on the truth; positive means it systematically overshoots; negative means it undershoots.

Dartboard analogy. Imagine throwing many darts at a target. The bias is the offset from the bullseye of where your throws average out. You can have high bias with low variance (all darts clumped but to the left of the bullseye) or low bias with high variance (darts scattered everywhere but averaging to the bullseye). Statisticians want both low.

The sample mean $\bar{X}$ is an unbiased estimator of $\mu$. This is not an accident — it's the reason we prefer it. The sample variance with the $n-1$ divisor is also unbiased for $\sigma^2$. The sample variance with $n$ is biased (too small by a factor $(n-1)/n$), which is exactly why people correct it.

Unbiasedness is nice but it isn't everything. An unbiased estimator can still have enormous variance — it can be correct on average but wildly off on any single sample. You also want a consistent estimator: one whose value converges to the truth as $n$ grows. Formally, $\hat{\theta}_n \to \theta$ in probability as $n \to \infty$. The sample mean is both unbiased and consistent. In general, you care most about the estimator's mean squared error:

$$\text{MSE}(\hat{\theta}) = \mathbb{E}\!\left[(\hat{\theta} - \theta)^2\right] = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta})$$

MSE of an estimator

$\text{MSE}(\hat{\theta})$
Expected squared distance between your estimator and the true parameter. The quantity you'd most like to make small.
$\text{Bias}(\hat{\theta})^2$
Squared systematic offset — zero for unbiased estimators, positive otherwise.
$\text{Var}(\hat{\theta})$
Sample-to-sample wobble of the estimator. Shrinks as $n$ grows (for a well-designed estimator).

Why this matters. An unbiased estimator with huge variance can be worse, in terms of MSE, than a slightly biased estimator with much smaller variance. This is the seed of the bias-variance tradeoff we'll see in section 12, and it's why "unbiasedness at any cost" is a bad policy.

5. Sampling distributions and the CLT

We just said $\bar{X}$ is a random variable — it takes different values across different samples. What distribution does it have?

This is called the sampling distribution of the mean. It tells you, over the space of all possible samples of size $n$, how $\bar{X}$ is distributed around the true $\mu$. Two facts about it are so useful they're worth tattooing on the inside of your eyelids.

5.1 The mean and variance of $\bar{X}$

No matter what distribution the data is drawn from, as long as $X_1, \dots, X_n$ are independent and identically distributed with mean $\mu$ and variance $\sigma^2$:

$$\mathbb{E}[\bar{X}] = \mu, \qquad \text{Var}[\bar{X}] = \frac{\sigma^2}{n}$$

Mean and variance of the sample mean

$\mathbb{E}[\bar{X}] = \mu$
On average across samples, $\bar{X}$ equals the true mean. This is the unbiasedness statement.
$\text{Var}[\bar{X}] = \sigma^2/n$
The variance of $\bar{X}$ shrinks as the sample gets bigger. Four times as many data points means one-fourth as much variance.
$\sigma^2$
Variance of a single observation. Fixed, doesn't depend on $n$.
$n$
Sample size. The $1/n$ shrinking factor is the reason bigger samples give more precise answers.

Why. Independent noise averages out. If you roll one die the result swings between 1 and 6. Average ten dice, the result is usually between 3 and 4. Average a thousand, it's pinned near 3.5. Statistics is almost entirely powered by this shrinking variance.

5.2 Standard error vs standard deviation

The standard deviation of the sample mean has its own name:

$$\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}$$

Standard error of the mean

$\text{SE}(\bar{X})$
Standard error: the standard deviation of the sampling distribution of $\bar{X}$.
$\sigma$
Population standard deviation (the spread of a single observation).
$\sqrt{n}$
Square root of sample size — the famous "inverse square root" shrinkage.

Standard deviation vs standard error. Don't confuse them! Standard deviation describes how spread out individual data points are. Standard error describes how spread out the mean estimate is. They're related by a $\sqrt{n}$, but they answer different questions. SD is about the data; SE is about how well you know the mean.

Why the $\sqrt{n}$, not $n$. Variance shrinks as $1/n$. Standard deviation is the square root of variance. So SE shrinks as $1/\sqrt{n}$. This is why doubling your sample size only improves your precision by $\sqrt{2} \approx 1.41\times$ — not by 2×. Want to halve the uncertainty? You need four times the data.

In practice you don't know $\sigma$. You estimate it from the sample and use $s/\sqrt{n}$. This is what NumPy, SciPy, and every statistics textbook actually compute.

5.3 The Central Limit Theorem

Here is the result that makes statistics work at all. Take any distribution with finite mean $\mu$ and finite variance $\sigma^2$ — any. Doesn't matter how weird it is. Skewed, discrete, bimodal, whatever. Draw $n$ independent samples from it. Compute the mean. Do that many times. Then:

$$\bar{X}_n \xrightarrow{d} \mathcal{N}\!\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{as} \quad n \to \infty$$

Central Limit Theorem (CLT)

$\bar{X}_n$
Sample mean of $n$ i.i.d. observations.
$\xrightarrow{d}$
"Converges in distribution". Means the distribution of $\bar{X}_n$ gets arbitrarily close to the Gaussian as $n$ grows, even though individual samples don't.
$\mathcal{N}(\mu, \sigma^2/n)$
Normal (Gaussian) distribution with mean $\mu$ and variance $\sigma^2/n$.
$n \to \infty$
Formally, the result is a limit. In practice, for many distributions, "$n \ge 30$" is plenty for the Gaussian approximation to be excellent.

Why this is astonishing. The underlying distribution could be the number of Twitter followers people have (massively skewed), or the time until a web request returns (heavy-tailed), or whether a coin comes up heads (discrete). The individual samples look nothing like a bell curve. But the mean of enough of them always does. Statistics gets to use Gaussian tools even when the data is not Gaussian, because the CLT guarantees that averages eventually are.

The CLT is why "standard error $= \sigma/\sqrt{n}$" is such a loadbearing formula. It's the width of the Gaussian that your estimate is drawn from. Whenever someone hands you a confidence interval or a p-value for a mean, they are almost always using a CLT approximation under the hood.

6. Interactive figure — sampling distribution narrowing

Here's the CLT made visible. The underlying population is a deliberately non-Gaussian mixture — a little skewed, a little lumpy, so you can't mistake it for a bell curve. We draw many samples of size $n$ from it and plot the histogram of the resulting sample means. As you crank $n$ up, two things happen:

  1. The distribution of $\bar{X}$ narrows — the standard error shrinks by $1/\sqrt{n}$.
  2. The distribution of $\bar{X}$ becomes Gaussian — even though the population is not.

Drag the slider. The orange bar histogram is the raw population (fixed). The blue histogram is the sampling distribution of the mean for the chosen $n$. The dashed curve is the Gaussian $\mathcal{N}(\mu, \sigma^2/n)$ that the CLT predicts.

Sample size n: n = 1

Orange: raw population. Blue: sampling distribution of the mean. Dashed: CLT-predicted Gaussian.

Notice how even at $n = 5$ the blue histogram already looks mostly bell-shaped, and the dashed Gaussian is a decent fit. By $n = 30$ it's indistinguishable. By $n = 200$ the bell is tiny — nearly all the mass is within a tight window around the true $\mu$. This is what "$1/\sqrt{n}$ convergence" looks like. It's slow (squared-root slow) but it's inevitable, for absolutely any reasonable underlying distribution.

7. Estimation — point and interval

So you have a sample and want to say something about a parameter. There are two basic flavors:

7.1 Building a confidence interval

The classical recipe for a 95% confidence interval for the mean, using the CLT:

$$\bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}}$$

95% confidence interval for the mean

$\bar{x}$
Your computed sample mean — the center of the interval.
$s$
Sample standard deviation of the data (unbiased version, $n-1$ divisor).
$s/\sqrt{n}$
Estimated standard error of the mean.
$1.96$
The 97.5th percentile of the standard normal. Chosen so that 95% of the Gaussian's mass is within $\pm 1.96$ standard deviations of zero.
"95%"
The coverage probability — not the probability that $\mu$ is in this specific interval. See below!

Where 1.96 comes from. Under the CLT, $(\bar{X} - \mu)/(\sigma/\sqrt{n})$ is approximately standard normal. Ninety-five percent of a standard normal's probability lies within $\pm 1.96$. Invert this, and you get a two-sided interval around $\bar{x}$ that, over repeated sampling, captures the true $\mu$ about 95% of the time.

For small samples, where you don't trust the Gaussian approximation, you replace 1.96 with a slightly larger number from the Student's t-distribution with $n-1$ degrees of freedom. SciPy does this for you. For $n$ bigger than about 30, the difference is small.

The t-distribution is named after William Sealy Gosset, who worked at the Guinness brewery in Dublin in the early 1900s and needed a way to reason about the quality of small samples of hops. Guinness wouldn't let him publish under his own name so he used the pseudonym "Student". A hundred years later every undergraduate statistics class teaches "Student's t-test" and almost nobody remembers why it isn't called Gosset's. The moral of the story, to the extent there is one, is that a lot of statistics was invented by practical people trying to solve real problems on small budgets.

7.2 What a confidence interval really means (and doesn't)

famous confusion

A 95% confidence interval does not mean "there is a 95% probability that $\mu$ is in this interval." That's a Bayesian-flavored statement, and in the frequentist framework it doesn't even make sense — $\mu$ is a fixed number, not a random variable, so either it's in your interval or it isn't.

What "95% confidence" actually means: if you repeated the whole experiment many times, drawing a fresh sample and constructing a fresh interval each time, then 95% of those intervals would cover the true $\mu$. It is a property of the procedure, not of any particular interval you computed.

This sounds like hair-splitting. It isn't. It's the difference between "I'm 95% sure it's in there" (Bayesian credibility) and "this is the output of a procedure that is right 95% of the time across hypothetical repetitions" (frequentist coverage). Almost everyone — including most scientists using the tool — conflates them. Usually it doesn't hurt. Occasionally it does.

8. Maximum Likelihood Estimation

Confidence intervals are great when you're estimating a mean. What about more complicated parameters? The Maximum Likelihood Estimator (MLE) is the workhorse answer. It's a general recipe: given a model with parameter $\theta$, pick the $\theta$ that makes your observed data the most plausible.

8.1 The likelihood function

Suppose your data $x_1, \dots, x_n$ is drawn independently from a distribution $p(x \mid \theta)$. The likelihood is the same function with the roles of data and parameter swapped:

$$\mathcal{L}(\theta) = \prod_{i=1}^{n} p(x_i \mid \theta)$$

Likelihood function

$\mathcal{L}(\theta)$
Likelihood. Treats the data as fixed (you already saw it) and asks: as a function of $\theta$, how plausible is each possible value?
$p(x_i \mid \theta)$
The probability (or density) the model assigns to observation $i$ when the parameter is $\theta$.
$\prod_{i=1}^{n}$
Product across all observations. Because the data is independent, the joint probability is the product of individual probabilities.

Crucial distinction. A probability is a function of data with $\theta$ fixed. A likelihood is a function of $\theta$ with data fixed. Same formula, different meaning. The MLE procedure picks the $\theta$ value that scores highest on this latter reading.

8.2 Log-likelihood

Products are awful to differentiate. Taking logs turns them into sums:

$$\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log p(x_i \mid \theta)$$

Log-likelihood

$\ell(\theta)$
Log-likelihood. Since $\log$ is monotone, maximizing $\ell$ is equivalent to maximizing $\mathcal{L}$.
$\sum_{i=1}^{n}$
Sum over observations. This is cheaper and numerically better behaved than multiplying $n$ small probabilities together.
$\log p(x_i \mid \theta)$
Log-probability of a single observation under the model with parameter $\theta$.

Why logs. If you multiply a thousand probabilities each around $10^{-4}$, you get $10^{-4000}$, which underflows any floating-point number. Logs turn that into $-9210$ or so, which is fine. Plus, derivatives of sums are easier than derivatives of products. Every practical ML loss function is secretly a negative log-likelihood.

8.3 The MLE recipe

  1. Write down the log-likelihood $\ell(\theta)$ for your model.
  2. Take $\frac{d\ell}{d\theta}$.
  3. Set it to zero and solve for $\theta$. That's your $\hat{\theta}_{\text{MLE}}$.
  4. (Optional) Check the second derivative is negative so you found a max, not a min.

8.4 Worked example — Gaussian MLE

Suppose $x_1, \dots, x_n \sim \mathcal{N}(\mu, \sigma^2)$ and you want MLEs for $\mu$ and $\sigma^2$. The Gaussian density is:

$$p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Gaussian density

$p(x \mid \mu, \sigma^2)$
Probability density at $x$ given mean $\mu$ and variance $\sigma^2$.
$\mu$
Mean (center of the bell).
$\sigma^2$
Variance (spread of the bell).
$\frac{1}{\sqrt{2\pi\sigma^2}}$
Normalizing constant that makes the density integrate to 1.
$\exp(-(x-\mu)^2/(2\sigma^2))$
The bell shape itself — drops off exponentially fast as you move away from $\mu$.

The log-likelihood, after summing over $n$ observations, simplifies to:

$$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

Gaussian log-likelihood

$-\frac{n}{2}\log(2\pi\sigma^2)$
The piece coming from the normalizing constant $1/\sqrt{2\pi\sigma^2}$. It depends on $\sigma^2$ but not on $\mu$.
$-\frac{1}{2\sigma^2}\sum (x_i - \mu)^2$
The "sum of squared deviations" piece from the Gaussian exponent. Notice it looks exactly like a least-squares loss — that's not a coincidence.

Why this is beautiful. Taking $\frac{\partial \ell}{\partial \mu} = 0$ gives you $\hat{\mu} = \bar{x}$, the sample mean. Taking $\frac{\partial \ell}{\partial \sigma^2} = 0$ gives you $\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$ — the sample variance with the $n$ divisor, not $n-1$. So the MLE for variance is actually slightly biased. You knew MLE was a powerful trick; here it also reveals that the "natural" quantities we've been computing all along are maximum-likelihood estimates under a Gaussian assumption.

9. Hypothesis testing

Estimation asks "what is $\theta$?". Testing asks "is $\theta$ equal to some specific value, or not?". This is the framework behind A/B testing, drug trials, and "is the model improvement statistically significant?" debates.

9.1 The basic setup

You pick two hypotheses about the world:

You then compute a test statistic from your data — a number that would be small if the null were true and large if the alternative were. Then you ask: how likely is a test statistic this extreme, if $H_0$ were the actual truth?

$$p\text{-value} = \Pr\!\big(\,T(X) \ge T(x_{\text{obs}}) \,\big|\, H_0\,\big)$$

p-value

$T(X)$
The test statistic applied to a hypothetical new dataset $X$.
$T(x_{\text{obs}})$
The test statistic applied to the data you actually collected. A fixed number once you've run the experiment.
$\Pr(\cdot \mid H_0)$
Probability computed under the assumption that the null hypothesis is true.
p-value
The probability, assuming $H_0$, of getting a test statistic at least as extreme as the one you observed. Small p-values mean "my data would be surprising if the null were true".

Courtroom analogy. The null is "the defendant is innocent." The p-value is "if they were innocent, how often would we expect evidence this damning?" A tiny p-value says "evidence this bad almost never happens to innocent defendants — so we doubt they are innocent." It does not say "the probability they're innocent is 5%". That's not what's being computed.

You pre-commit to a significance level $\alpha$, usually $0.05$, and reject $H_0$ if $p < \alpha$. The $\alpha$ is the rate at which you're willing to wrongly reject a true null — more on that in a second.

9.2 t-test worked example

The most famous hypothesis test: are two sample means different? Say group A has $n_A$ observations with mean $\bar{x}_A$ and variance $s_A^2$, and similarly for group B. The two-sample t-statistic is:

$$t = \frac{\bar{x}_A - \bar{x}_B}{\sqrt{s_A^2/n_A + s_B^2/n_B}}$$

Two-sample t-statistic (Welch's form)

$\bar{x}_A, \bar{x}_B$
Sample means of the two groups.
$s_A^2, s_B^2$
Sample variances.
$n_A, n_B$
Sample sizes.
denominator
Estimated standard error of the difference of the two means. It combines the two groups' standard errors in quadrature because variances add under independence.
$t$
Number of standard errors separating the two sample means. Under $H_0$ (the two populations have the same mean), $t$ is approximately distributed as Student's t with some number of degrees of freedom, and large $|t|$ makes $H_0$ look unlikely.

What's happening. You're dividing the observed difference in means by the scale at which you'd expect the difference to randomly wobble. A $t$ of 0.5 means the effect is half a standard error — totally consistent with noise. A $t$ of 5 means the effect is five standard errors — possible under noise but spectacularly unlikely.

9.3 Type I and Type II errors, and power

Testing is a binary decision based on uncertain data, so two kinds of error are possible:

The power of a test is $1 - \beta$ — the probability of correctly detecting a real effect. It depends on the true size of the effect, the sample size, the noise level, and your choice of $\alpha$.

pre-register your sample size

Power analysis is what tells you how many samples you need before running the experiment. Rule of thumb: for a two-sample t-test with $\alpha = 0.05$ and 80% power to detect an effect of size 0.5 (half a standard deviation), you need about 64 per group. For small effects (0.1), you need ~1600 per group.

If your experiment is underpowered and you detect nothing, that's not evidence the effect is zero — it's evidence your experiment couldn't have spotted it either way.

10. The p-value debate

P-values get abused more than probably any other concept in statistics. A brief list of things a p-value is not:

common p-value mistakes
  • It is not the probability that $H_0$ is true. Frequentist p-values don't assign probabilities to hypotheses at all. $H_0$ is either true or not; no probability involved.
  • It is not the probability you are wrong to reject $H_0$. That would conflate the question "given the data, what's $\Pr(H_0)$?" with the question "given $H_0$, what's $\Pr(\text{data})$?" — and those are different by Bayes' rule.
  • $p < 0.05$ does not mean the effect is large. With enough data, trivially small effects trigger tiny p-values. Always report the effect size, not just significance.
  • $p > 0.05$ does not mean no effect. It means your experiment didn't have enough power to detect the effect. Absence of evidence is not evidence of absence.
  • Comparing two p-values is meaningless. "$p = 0.04$ in group A and $p = 0.06$ in group B" does not mean A beats B. They're on different noise scales.

Since about 2005 a loose coalition of statisticians (Ioannidis' "Why Most Published Research Findings Are False", the ASA's 2016 statement on p-values, the replication crisis in psychology and biomedicine) has been pushing for researchers to de-emphasize p-values in favor of effect sizes, confidence intervals, and pre-registration. The p-value isn't broken, but treating $p < 0.05$ as a magic passing grade is. The recommendation: use p-values as a single piece of evidence among many, never as the whole story, and always report effect sizes and uncertainty intervals alongside.

11. Linear regression

Almost every interesting statistical question — effect of a drug dose on blood pressure, relationship between study hours and grades, the slope of the Chinchilla loss curve — can be framed as "fit a line". Linear regression is the simplest and oldest version of that, and absolutely fundamental.

11.1 The model

You have paired data $(x_1, y_1), \dots, (x_n, y_n)$. You posit the model:

$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0, \sigma^2)$$

Simple linear regression model

$y_i$
The $i$-th response (dependent variable) — the thing you're trying to predict.
$x_i$
The $i$-th predictor (independent variable, feature).
$\beta_0$
Intercept — the value of $y$ when $x = 0$.
$\beta_1$
Slope — how much $y$ changes per unit of $x$.
$\varepsilon_i$
Error / noise term. Assumed Gaussian, mean zero, variance $\sigma^2$, independent across observations.
$\mathcal{N}(0, \sigma^2)$
Normal distribution with mean zero and variance $\sigma^2$. This encodes the idealized assumption that the noise is Gaussian and homoscedastic (same variance everywhere).

The story. Nature puts a line through the world with intercept $\beta_0$ and slope $\beta_1$, and then jitters every observation off the line by a Gaussian noise amount $\varepsilon_i$. Your job is to find the line, given that you only see the noisy $(x_i, y_i)$ pairs.

11.2 Least-squares loss

The fit that minimizes the sum of squared residuals:

$$\text{SSE}(\beta_0, \beta_1) = \sum_{i=1}^{n}\big(y_i - \beta_0 - \beta_1 x_i\big)^2$$

Sum of squared errors

$\text{SSE}$
Sum of squared errors — the total squared gap between the predicted line and the data.
$y_i - \beta_0 - \beta_1 x_i$
Residual for observation $i$ — how much the data point misses the line by.
$(\cdot)^2$
Squared so overshoots and undershoots don't cancel, and so large residuals are penalized disproportionately.

Taking derivatives with respect to $\beta_0$ and $\beta_1$ and setting them to zero gives the normal equations:

$$\hat{\beta}_1 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}, \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

Normal equations (closed form)

$\hat{\beta}_1$
Estimated slope. The numerator is the sample covariance of $x$ and $y$; the denominator is the sample variance of $x$ (up to a factor of $n-1$ that cancels).
$\hat{\beta}_0$
Estimated intercept. Once the slope is pinned down, the intercept is chosen so that the fitted line passes through $(\bar{x}, \bar{y})$.
$(x_i - \bar{x})(y_i - \bar{y})$
Centered product — positive when $x$ and $y$ are both above or both below their means, negative when they're on opposite sides. The sum is the sample covariance (times $n-1$).

Intuition. The slope is "average co-movement of $x$ and $y$, divided by how much $x$ varies on its own." If $x$ and $y$ swing together, the numerator is big and the slope is big. If $x$ swings but $y$ doesn't, the slope is small.

11.3 Least squares is MLE under Gaussian noise

Here's a beautiful connection. Under the Gaussian noise model above, the log-likelihood of the data is (up to constants not depending on $\beta$):

$$\ell(\beta_0, \beta_1) = -\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i - \beta_0 - \beta_1 x_i)^2 + \text{const}$$

Linear regression log-likelihood

$\ell$
Log-likelihood as a function of the slope and intercept.
$-\frac{1}{2\sigma^2}\sum(y_i - \beta_0 - \beta_1 x_i)^2$
Up to a minus sign and a constant, this is exactly the SSE. Maximizing this log-likelihood is the same as minimizing SSE.

So. Least-squares regression is not an arbitrary choice. It's the maximum-likelihood estimator when you assume the noise is Gaussian with constant variance. If the noise were Laplace-distributed, MLE would give you sum-of-absolute-errors (L1 regression). Whenever you see a loss function, ask "what noise model does this correspond to?" and the answer is usually enlightening.

12. Bias-variance tradeoff

Fitting a line to data is a model with two parameters. What if your data actually curves? You could fit a parabola. What if it wiggles? A 10th-degree polynomial. What if it really wiggles? A 100th-degree polynomial. Each increase in complexity reduces training error — but past some point, test error starts getting worse. This is the bias-variance tradeoff, the most important idea in machine learning after "gradient descent".

Start with the mean-squared error of an estimator $\hat{f}(x)$ of a true function $f(x)$, averaged over the noise and over the randomness in the training data. A bit of algebra gives the famous decomposition:

$$\mathbb{E}\!\left[(y - \hat{f}(x))^2\right] = \underbrace{\big(\mathbb{E}[\hat{f}(x)] - f(x)\big)^2}_{\text{bias}^2} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{variance}} + \underbrace{\sigma^2}_{\text{irreducible noise}}$$

Bias-variance decomposition

$\mathbb{E}\!\left[(y - \hat{f}(x))^2\right]$
Expected squared prediction error at a new test point $x$, averaged over everything random (noise, training set).
$f(x)$
The true underlying function. What you wish you knew.
$\hat{f}(x)$
Your fitted model's prediction at $x$. A random variable because it depends on the training sample.
$\mathbb{E}[\hat{f}(x)] - f(x)$
Bias: the systematic error. How much your model's average prediction misses the truth by. Too-simple models (a straight line for a parabola) have large bias.
$\text{Var}[\hat{f}(x)]$
Variance: how much $\hat{f}(x)$ jumps around as the training set changes. Too-complex models (a 100-degree polynomial) have huge variance.
$\sigma^2$
Irreducible noise in $y$. No model can beat this floor.

Intuition. A model that is too rigid doesn't have room to fit the truth — bias is high. A model that is too flexible fits not just the truth but the noise — variance is high, because a different training sample would pick up different noise and give a wildly different fit. Generalization error is the sum of both. You want to sit at the sweet spot where they balance.

"Overfitting" in machine learning is this decomposition talking. A neural network with billions of parameters fit to a small dataset has near-zero bias (it can interpolate anything) but astronomical variance (it memorizes noise). Regularization, dropout, data augmentation, early stopping, weight decay — all of these are techniques that push you back down toward a higher-bias-but-lower-variance regime, trading some expressiveness for generalization.

A curiosity from the deep-learning era: the classical U-shape of test error versus model complexity (high bias on the left, high variance on the right, sweet spot in the middle) has a second descent for modern overparameterized networks. Fit a small network to MNIST and you see the textbook shape. Fit a very large one and the test error drops again past the interpolation threshold. This "double descent" phenomenon (Belkin et al., 2019) is an active research area and a reminder that bias-variance intuition is load-bearing but not the whole story. In the strongly overparameterized regime, implicit regularization from the optimizer itself does some of the work.

13. Bayesian vs frequentist — the two schools

Everything above is the frequentist school. It treats parameters as fixed-but-unknown and talks about probabilities of data given parameters. The Bayesian school goes the other way: parameters are random variables with their own distributions, and you compute probabilities of parameters given data via Bayes' rule:

$$p(\theta \mid x) = \frac{p(x \mid \theta)\, p(\theta)}{p(x)}$$

Bayes' rule for inference

$p(\theta \mid x)$
Posterior distribution — your updated belief about $\theta$ after seeing the data.
$p(x \mid \theta)$
Likelihood — the same likelihood function from the MLE section.
$p(\theta)$
Prior — your belief about $\theta$ before seeing any data. This is the part frequentists object to, because where does it come from?
$p(x)$
Marginal likelihood / evidence. A normalizing constant. Often intractable in practice, which is why Bayesian computation is hard.

Upshot. Bayesian statistics is philosophically clean — you get actual probabilities on parameters, credible intervals that mean exactly what you'd expect ("95% probability $\theta$ is in here"), and a principled way to incorporate prior knowledge. The cost is computational complexity and the ambiguity of picking a prior.

Informally: frequentist methods are sharper, cheaper, and good defaults when data is abundant and you'd rather not commit to a prior. Bayesian methods shine when data is scarce, when you genuinely do have prior knowledge to inject, and when you want to marginalize uncertainty through a complex model. Modern ML is a curious mix — stochastic gradient descent is frequentist in spirit, but ensembling, dropout, and Bayesian neural networks import Bayesian ideas whenever they help. Pick the tool that gets you an honest answer, not the tribe.

One last fact worth knowing. In the limit of lots of data, Bayesian and frequentist answers converge. The posterior concentrates around the maximum-likelihood estimate, and credible intervals become numerically indistinguishable from confidence intervals. The two schools argue loudest precisely where it matters most — in the small-data regime, where your prior (or lack of it) is doing real work. "Big data" mostly makes the philosophy moot; "small data" brings it roaring back.

a useful frame Frequentist statistics asks: "Over hypothetical repetitions of this experiment, how would my procedure behave?" Bayesian statistics asks: "Given what I've already seen, what should I believe?" Both are legitimate questions. Neither is the "true" meaning of probability. If anyone tries to sell you a monopoly on rigor, they're selling you the tribe, not the math.

14. Code — NumPy and SciPy

Here's a working tour of the page's main operations. Fit a line by hand and with numpy.polyfit, build a confidence interval, run a t-test, and do a bootstrap.

statistics · NumPy / SciPy
import numpy as np

# Dataset: response times in ms from a web service.
rt = np.array([112, 98, 131, 104, 120, 89, 140, 115, 102, 155])

# Measures of center
mean   = np.mean(rt)
median = np.median(rt)

# Measures of spread. Use ddof=1 for the unbiased (n-1) estimator.
var_n1 = np.var(rt, ddof=1)
std_n1 = np.std(rt, ddof=1)

# Quantiles
q1, q3 = np.percentile(rt, [25, 75])
iqr    = q3 - q1

print(f"n        = {len(rt)}")
print(f"mean     = {mean:.2f} ms")
print(f"median   = {median:.2f} ms")
print(f"std (s)  = {std_n1:.2f} ms")
print(f"IQR      = {iqr:.2f} ms")

# Standard error of the mean
se = std_n1 / np.sqrt(len(rt))
print(f"SE(mean) = {se:.2f} ms")
import numpy as np

# Fake data: y = 2.3 x + 1.5 + noise
rng = np.random.default_rng(0)
x = np.linspace(0, 10, 50)
y = 2.3 * x + 1.5 + rng.normal(0, 1.2, size=x.shape)

# Method 1: normal equations by hand
x_bar, y_bar = x.mean(), y.mean()
beta1 = np.sum((x - x_bar) * (y - y_bar)) / np.sum((x - x_bar) ** 2)
beta0 = y_bar - beta1 * x_bar
print(f"hand:    slope={beta1:.3f}  intercept={beta0:.3f}")

# Method 2: numpy.polyfit (same answer, up to rounding)
b1, b0 = np.polyfit(x, y, deg=1)
print(f"polyfit: slope={b1:.3f}  intercept={b0:.3f}")

# Predict and report R^2
y_hat = beta0 + beta1 * x
ss_res = np.sum((y - y_hat) ** 2)
ss_tot = np.sum((y - y_bar) ** 2)
r2 = 1 - ss_res / ss_tot
print(f"R^2 = {r2:.3f}")
import numpy as np
from scipy import stats

rng = np.random.default_rng(42)

# Two groups with different means (treatment vs control)
control   = rng.normal(loc=100, scale=15, size=50)
treatment = rng.normal(loc=107, scale=15, size=50)

# 95% CI for the control mean, using the t-distribution.
m  = control.mean()
s  = control.std(ddof=1)
n  = len(control)
se = s / np.sqrt(n)
# Two-sided 95% t critical value with n-1 dof:
tcrit = stats.t.ppf(0.975, df=n - 1)
ci_low  = m - tcrit * se
ci_high = m + tcrit * se
print(f"control mean = {m:.2f}  95% CI = [{ci_low:.2f}, {ci_high:.2f}]")

# Welch's two-sample t-test: are the means different?
t_stat, p_val = stats.ttest_ind(treatment, control, equal_var=False)
print(f"t = {t_stat:.2f}, p = {p_val:.4f}")
if p_val < 0.05:
    print("reject H0 at alpha=0.05 — groups differ")
else:
    print("fail to reject H0 — no detected difference")
import numpy as np

# Bootstrap: empirical sampling distribution of any statistic,
# by resampling the data with replacement.
rng = np.random.default_rng(7)
data = rng.gamma(shape=2.0, scale=3.0, size=200)  # skewed!

def bootstrap_ci(data, stat_fn, B=5000, alpha=0.05):
    n = len(data)
    boots = np.empty(B)
    for b in range(B):
        idx = rng.integers(0, n, size=n)
        boots[b] = stat_fn(data[idx])
    lo, hi = np.percentile(boots, [100 * alpha / 2, 100 * (1 - alpha / 2)])
    return stat_fn(data), lo, hi

est, lo, hi = bootstrap_ci(data, np.median)
print(f"median = {est:.3f}  95% bootstrap CI = [{lo:.3f}, {hi:.3f}]")

# Bootstrap is magic: it works for any statistic, including ones
# with no closed-form sampling distribution (median, quantiles,
# correlation, ratio of two means). When you're unsure, bootstrap.

15. Why this all matters for ML

Machine learning is applied statistics with extra compute. Every standard loss function in deep learning is a disguised maximum-likelihood estimator under some noise model.

"Overfitting" is also a statistics word in disguise. A neural network that perfectly fits 10,000 random-label MNIST examples has zero training error but test accuracy at chance. In the bias-variance language, you've driven bias to zero at the cost of colossal variance. Regularization, data augmentation, and early stopping are all techniques that deliberately add bias to reduce variance, exactly the tradeoff the decomposition suggested.

When you read about "held-out validation sets", "k-fold cross-validation", "bootstrap ensembling" — you are reading about techniques for estimating the variance of an estimator without being able to re-sample from the true population. You only have one dataset, so you simulate "drawing different samples" by resampling or splitting what you have. This is the frequentist imagination applied to practical ML: pretend you could rerun the experiment, estimate the answer's wobble from the rerun, and report that wobble honestly. Every "error bar" you've ever seen on a benchmark chart is an application of this idea.

One more connection worth calling out. When you hear the phrase "statistical significance" used in an ML paper — "Model A beats Model B, $p < 0.01$" — that is a t-test (or more often a bootstrap test) over random seeds or evaluation splits. Treating each seed as an independent draw and comparing the distribution of scores gives you exactly the hypothesis-testing framework from section 9. A lot of ML papers don't do this; they just report one number per model and hope. As datasets plateau and gains get smaller, that's becoming untenable, and the field is finally learning to report uncertainty.

Once you see this, the foundation models page starts to read differently: the Kaplan and Chinchilla scaling laws are statements about how training loss (a log-likelihood average) improves with sample size $D$ and parameter count $N$. The irreducible $L_\infty$ floor is the entropy of the data itself — the $\sigma^2$ of the bias-variance decomposition. Every scaling curve is a picture of "how fast does the variance of my MLE shrink relative to the bias I'm leaving on the table?"

See also — math

foundations

Probability — the forward direction, model → data. The calculus of randomness.

Calculus — derivatives, which you need for maximizing log-likelihoods.

Linear algebra — multivariate regression, covariance matrices, and least squares live here.

Numerical analysis — when closed forms fail, you iterate. QR, SVD, and conditioning matter in practice.

See also — AI/ML

applications

Foundation models — scaling laws as sampling-distribution narrowing writ large.

AI safety — why statistical reasoning about evals and power analyses is essential for capability testing.

See also — CS

algorithms & optimization

Optimization — gradient descent, the workhorse for finding MLEs when closed forms don't exist.

16. Summary

Further reading

  • Wasserman, Larry (2004) — All of Statistics: A Concise Course in Statistical Inference. The fastest path through the frequentist toolkit for mathematically comfortable readers.
  • Casella & Berger (2002) — Statistical Inference (2nd ed.). The classical graduate-level text. Thorough, rigorous, and a reference you'll keep on the shelf.
  • Gelman, Carlin, Stern, Dunson, Vehtari & Rubin (2013) — Bayesian Data Analysis (3rd ed., "BDA3"). The definitive Bayesian treatment with tons of worked examples.
  • Efron & Hastie (2016) — Computer Age Statistical Inference. Brings the classical ideas forward to the era of resampling, boosting, and neural nets.
  • Ioannidis, J. P. A. (2005) — Why Most Published Research Findings Are False. The article that kicked off the p-value / replication debate.
  • ASA Statement on p-values (Wasserstein & Lazar, 2016) — the American Statistical Association's official guidance on interpreting p-values correctly.
NEXT UP
→ Foundation Models

You just learned that every ML loss function is a disguised maximum likelihood estimator. Now see what happens when you crank that MLE up to a hundred billion parameters and a trillion tokens: the scaling laws of modern foundation models.