Calculus

Two ancient questions — "how steep is this curve right here?" and "what's the area under it?" — turn out to be the same question in disguise. Calculus is the machinery that exposes the trick. Once you see it, derivatives and integrals stop feeling like random rules and start feeling inevitable.

Prereq: algebra (functions, exponents, a little trig) Read time: ~35 min Interactive figures: 1 Code: NumPy, Python

1. Why calculus exists

Imagine you're driving a car. The speedometer reads 37 mph. What does that number actually mean? You weren't going 37 mph for an hour — maybe you were at 37 mph for a single instant. But "speed during an instant" sounds nonsensical. In an instant, you don't move anywhere. Distance travelled divided by zero time is 0/0, which is nothing.

Or: imagine a plot of land shaped like a pond. You want to know its area. Straight-sided fields are easy — length times width. But the pond's edge curves in and out. Every formula you know assumes straight edges. How do you handle the curve?

Both of these are calculus problems. The first is the tangent problem (how fast is something changing at one precise moment?). The second is the area problem (how much stuff is under a wiggly curve?). For two thousand years mathematicians attacked them piecemeal — Archimedes nearly cracked the area problem with his method of exhaustion around 250 BCE — but nobody had a general technique.

Then in the 1660s and 1670s, two people independently figured out the whole thing. Isaac Newton in Cambridge was trying to describe the motion of planets. Gottfried Wilhelm Leibniz in Germany was trying to formalize how tiny quantities combine. They arrived at the same idea from different directions, using different notation. Newton called it "the method of fluxions." Leibniz called it calculus differentialis and calculus integralis. An ugly priority dispute followed (we'll use Leibniz's notation because it's better), but the mathematics was the same. They had discovered that the tangent problem and the area problem are secretly the same problem, and there's a mechanical procedure for solving either one.

THE PUNCHLINE

Differentiation finds instantaneous rates of change. Integration finds accumulated totals. The Fundamental Theorem of Calculus says these two operations are inverses of each other, the way addition and subtraction are inverses. Once you believe that, the whole subject falls into place.

Why should you, a person building software in 2026, care? Three reasons, in order of how directly you'll touch them:

Optimization. Nearly every machine learning model is trained by minimizing a loss function, and the tool that says "go this direction to make the loss smaller" is the derivative. Gradient descent is calculus.
Backpropagation. The reason we can train neural networks with billions of parameters is one rule from calculus applied recursively: the chain rule. Backprop is literally "the chain rule, bookkept cleverly."
Everything numerical. Physics engines, graphics, signal processing, statistics, control systems, probability densities — all of it assumes you speak calculus. If you want to read a paper, derive a formula, or debug a training loop, you need the vocabulary.

This page will teach you calculus from zero. You do not need to remember anything from high school. You do need to be comfortable with functions (the rule "f takes a number x in and gives a number out"), exponents ($x^2$, $x^3$), and the idea that a graph plots $y = f(x)$ in the plane. That's it.

2. Vocabulary cheat sheet

These are the symbols you'll see repeatedly. Glance at the list now; the sections below define each one properly.

Symbol	Read as	Means
$f(x)$	"f of x"	A function $f$ applied to input $x$; gives a number out.
$\lim_{x \to a}$	"limit as x approaches a"	What value the expression gets arbitrarily close to as $x$ gets close to $a$.
$f'(x)$	"f prime of x"	The derivative — instantaneous rate of change of $f$ at $x$. A new function.
$\dfrac{df}{dx}$	"dee f dee x"	Same as $f'(x)$. Leibniz notation. The ratio of a tiny change in $f$ to the tiny change in $x$ that caused it.
$\Delta x$	"delta x"	A finite (not infinitesimal) change in $x$. Capital $\Delta$ is Greek for "difference."
$dx$	"dee x"	An infinitesimal change in $x$ — a stand-in for "a change so small we're about to take a limit."
$\int f(x)\, dx$	"integral of f of x dee x"	The area under the curve $y = f(x)$ — or, equivalently, an antiderivative of $f$.
$\sum_{i=1}^n$	"sum from i equals 1 to n"	Add up the terms that follow, plugging in $i = 1, 2, \dots, n$.
$\epsilon, \delta$	"epsilon, delta"	Greek letters standing for "a tiny positive number." Used in the formal definition of limits.
$f \circ g$	"f composed with g"	The function $x \mapsto f(g(x))$ — first do $g$, then do $f$ on the result.

One warning before we start: notation in calculus is historically messy. You will see $f'(x)$, $\dfrac{df}{dx}$, $\dot{f}$, and $Df$ all used interchangeably to mean exactly the same thing. Don't let the four different spellings throw you — they're all saying "the derivative of $f$."

3. Limits — the one idea everything else is built on

A limit is a way of asking: "if I plug numbers into this expression that are closer and closer to $a$, what value does the expression get closer and closer to?" It's a sneaky question because you're not asking what happens when $x$ equals $a$. You're asking what happens on the approach.

Limit (informal). We write $\lim_{x \to a} f(x) = L$ to mean: as $x$ takes values arbitrarily close to $a$ (but not equal to $a$), the value $f(x)$ gets arbitrarily close to $L$. The value $f(a)$ itself is allowed to be anything, or undefined; the limit doesn't care.

Why would you ever ask that weird question? Because some of the most important quantities in mathematics come out as $0/0$ or $\infty - \infty$ when you plug in the value you want. Limits are a workaround: instead of plugging in directly, you sneak up on the answer.

Here's the canonical example. Consider the function

f(x) = \frac{x^2 - 1}{x - 1}.

A function with a hole in it

$f(x)$: The output of our function when you feed in $x$.
$x^2$: "$x$ squared" — multiply $x$ by itself.
$x - 1$: Just the input minus one.
$\dfrac{\cdot}{\cdot}$: Division. The horizontal bar means divide the top by the bottom.

The problem. If you plug in $x = 1$, you get $\frac{0}{0}$, which is undefined — division by zero. The function has no output at $x = 1$. But for every $x$ close to 1 (like 0.99 or 1.01), the output is well-defined. What's it heading toward?

Try plugging in $x$ values that sneak up on 1 from below and above:

$f(0.9) = (0.81 - 1) / (0.9 - 1) = (-0.19) / (-0.1) = 1.9$
$f(0.99) = 1.99$
$f(0.999) = 1.999$
$f(1.001) = 2.001$
$f(1.01) = 2.01$

As $x$ gets closer to 1, $f(x)$ gets closer to 2. It never reaches 2 exactly — the function isn't defined at $x=1$ at all — but the trend is unmistakable. We say $\lim_{x \to 1} f(x) = 2$.

You can also see it algebraically. The numerator $x^2 - 1$ factors as $(x-1)(x+1)$, so

\frac{x^2 - 1}{x - 1} = \frac{(x-1)(x+1)}{x-1} = x + 1 \quad (\text{when } x \ne 1).

Canceling a common factor

$(x-1)(x+1)$: A product. This is "$x$ minus one" times "$x$ plus one". Expand it and you get $x^2 - 1$ — that's called the difference-of-squares identity.
$\dfrac{(x-1)(x+1)}{x-1}$: The top and bottom share a factor of $(x-1)$. When $x \ne 1$ this factor is non-zero, so we can cancel it.
$x + 1$: The simplified form — identical to the original everywhere except at $x = 1$, where the original is undefined and this one equals 2.

The intuition. The function is a perfectly smooth line $y = x + 1$ with a single missing pixel at $x = 1$. The limit "fills in" what that pixel ought to be. Limits are a formal way to look through the hole.

$\epsilon$–$\delta$: the idea without the formalism

Textbook calculus, starting with Cauchy and Weierstrass in the 1800s, makes "arbitrarily close" precise using two Greek letters: $\epsilon$ (epsilon) and $\delta$ (delta). Both just mean "a tiny positive number." The full definition looks terrifying the first time you see it, so we'll state the idea in English first.

THE $\epsilon$–$\delta$ GAME

Someone challenges you: "I bet $f(x)$ can't get within $\epsilon = 0.001$ of the limit $L$." You reply: "Give me any $\epsilon$ you want, and I'll hand back a $\delta$. If you stay within $\delta$ of $a$ (but not equal to $a$), I promise $f(x)$ is within $\epsilon$ of $L$." If you can always win that game, no matter how small $\epsilon$ they pick, then the limit is $L$. That's it.

Stated formally: $\lim_{x \to a} f(x) = L$ means that for every $\epsilon > 0$ there exists a $\delta > 0$ such that whenever $0 < |x - a| < \delta$, we have $|f(x) - L| < \epsilon$. You do not need to be able to write $\epsilon$–$\delta$ proofs to do calculus — almost nobody who uses calculus in practice does. You do need the mental picture: limits are a promise that you can always get as close as someone demands.

One-sided limits, infinite limits, continuity

A few variations you'll run into:

One-sided limit. $\lim_{x \to a^+} f(x)$ means "as $x$ approaches $a$ from the right (values bigger than $a$)." Similarly $\lim_{x \to a^-}$ approaches from below. The two-sided limit exists only if both one-sided limits exist and agree.
Infinite limit. $\lim_{x \to a} f(x) = \infty$ means the value of $f$ grows without bound. It's not saying $f(a) = \infty$ — infinity isn't a number — it's saying "the values keep going up, with no ceiling."
Limit at infinity. $\lim_{x \to \infty} f(x) = L$ means "as $x$ grows without bound, $f$ gets arbitrarily close to $L$." This is how horizontal asymptotes are defined.

A function $f$ is continuous at $a$ if three things hold: $f(a)$ exists, $\lim_{x \to a} f(x)$ exists, and the two are equal. In pictures: no gaps, no jumps, no holes — you can draw the graph without lifting your pencil. Most functions you deal with in practice (polynomials, exponentials, sines and cosines, their sums and products) are continuous everywhere. Functions with pieces glued together sometimes aren't, and that's where the interesting behavior lives.

4. Derivatives — instantaneous rates of change

Back to the car. You want to know the speed "at an instant" — at a single moment $t$. Speed is "distance divided by time", but during an instant no time passes and no distance is covered. So you approximate: measure how far you went during a tiny interval of time, and divide.

Let $s(t)$ be your position at time $t$. Over the interval from $t$ to $t + \Delta t$, your average speed is

\text{average speed} = \frac{s(t + \Delta t) - s(t)}{\Delta t}.

Average rate of change (a.k.a. secant slope)

$s(t)$: Your position as a function of time. Plug in a time, get back a location (say, meters from your front door).
$t$: The starting time.
$\Delta t$: A small time interval — "how much time has passed." The capital Greek delta means "change in."
$s(t + \Delta t) - s(t)$: How far you've moved during that interval (final position minus starting position).
$\dfrac{\cdot}{\Delta t}$: Distance divided by time — that's speed. Standard units would be meters per second.

Picture. Plot $s$ against $t$. Pick two points on the curve at times $t$ and $t + \Delta t$. Draw the straight line through them. This expression is the slope of that line — rise over run. It's called a "secant" line because it slices through the curve at two points.

The secant slope is the average speed over an interval. Instantaneous speed is what happens as you shrink the interval to nothing. You take a limit:

s'(t) = \lim_{\Delta t \to 0} \frac{s(t + \Delta t) - s(t)}{\Delta t}.

The definition of the derivative

$s'(t)$: Read "s prime of t." A new function of $t$ whose value at each time is the instantaneous rate of change of $s$ at that time.
$\lim_{\Delta t \to 0}$: Take the limit as $\Delta t$ shrinks to zero. Remember: we never actually set $\Delta t$ equal to zero (that would be division by zero); we sneak up on it.
$\dfrac{s(t+\Delta t) - s(t)}{\Delta t}$: The average speed over the interval $[t, t+\Delta t]$ — the secant slope.

The geometric punchline. As $\Delta t$ shrinks, the two points you drew through get closer and closer. The secant line rotates and settles into a final position: the tangent line — the straight line that just grazes the curve at the single point $(t, s(t))$. The derivative is the slope of that tangent line. Instantaneous rate of change = slope of the tangent line. Two names for the same thing.

Replacing $s$ with a generic function $f$, you get the definition you'll see in every textbook:

f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}.

Definition of the derivative (generic form)

$f$: Any function of one real variable.
$f'(x)$: The derivative of $f$, evaluated at $x$. This is itself a function — plug in any $x$, get back a number.
$h$: The small change in input, same role as $\Delta x$ or $\Delta t$ above. Using a single letter saves space.
$f(x + h) - f(x)$: How much the output changed when we bumped the input by $h$.
$\dfrac{f(x+h) - f(x)}{h}$: Average rate of change over the interval from $x$ to $x+h$. Geometrically: slope of the secant line between those two points on the graph.

What the derivative "is." It's a machine. Feed in a function, get back a new function. The new function tells you the slope of the original at every point. If the original rises, the derivative is positive. If it falls, negative. If it has a peak or valley, the derivative is zero there. This last fact is why optimization uses derivatives — zero slope means "can't locally go up or down" which means "extremum."

Notation, and why it's a mess

Four notations you will see, all meaning the same thing:

f'(x) \;=\; \frac{df}{dx} \;=\; \dot{f}(x) \;=\; Df(x).

Four ways to spell the same idea

$f'(x)$: Lagrange notation. A literal prime mark on the function name. Compact and standard.
$\dfrac{df}{dx}$: Leibniz notation. Looks like a fraction; isn't one, really, but behaves like one in many useful ways. The "$df$" and "$dx$" are meant to evoke infinitesimally small changes. Tells you what we're differentiating ($f$) with respect to what ($x$), which matters when more than one variable is around.
$\dot{f}(x)$: Newton's notation. The dot over the letter means "derivative with respect to time." Mostly used in physics.
$Df(x)$: Operator notation. $D$ is the "differentiation operator" — a function that takes a function and returns its derivative. Mostly used in advanced or computational settings.

Why the mess? Historical politics. Newton and Leibniz invented calculus in parallel, each with their own notation. British mathematicians stuck with Newton's dots for a century out of loyalty and fell behind the Continent as a result. Leibniz's $\frac{df}{dx}$ won on merit: it's more flexible, and it makes the chain rule look like a fraction cancellation. But everybody still uses $f'$ for brevity. So you will see all four, sometimes on the same page.

The basic rules (the ones you'll use constantly)

You rarely apply the limit definition in practice. Instead you memorize a handful of rules that cover most functions. Here are the essentials.

The power rule. For any number $n$,

\frac{d}{dx}\bigl(x^n\bigr) = n \, x^{n-1}.

Power rule

$\dfrac{d}{dx}$: "The derivative with respect to $x$ of …" — an operator applied to what follows.
$x^n$: $x$ to the power $n$. For positive integer $n$, this is $x$ multiplied by itself $n$ times. For non-integer $n$, it's defined through exponentials.
$n \, x^{n-1}$: The result: the exponent drops out front as a coefficient and the new exponent is one less.

How to remember. "Bring the exponent down, subtract one from the exponent." So $\frac{d}{dx}(x^3) = 3x^2$, $\frac{d}{dx}(x^7) = 7x^6$, and $\frac{d}{dx}(x) = 1 \cdot x^0 = 1$. A constant $c$ can be written $c \cdot x^0$, and the rule gives $0 \cdot c \cdot x^{-1} = 0$ — the derivative of a constant is zero, which makes sense because a constant function has slope zero everywhere.

Sum and constant-multiple rules. Derivatives distribute over addition, and constants pull out:

\frac{d}{dx}\bigl(f(x) + g(x)\bigr) = f'(x) + g'(x), \qquad \frac{d}{dx}\bigl(c \, f(x)\bigr) = c \, f'(x).

Linearity of the derivative

$f(x), g(x)$: Two functions of $x$.
$c$: A constant — a number that doesn't depend on $x$, like $3$ or $\pi$.
$f'(x) + g'(x)$: Differentiate each piece separately, then add.

Why it works. If two cars are moving, the rate of change of their combined position is the sum of their individual rates. Obvious when you phrase it that way. This linearity is why differentiation plays well with linear algebra — it's itself a linear operator.

Product rule. When you multiply two functions, you do not just multiply the derivatives. The correct rule is:

\frac{d}{dx}\bigl(f(x) \, g(x)\bigr) = f'(x) \, g(x) + f(x) \, g'(x).

Product rule

$f(x) \, g(x)$: The product of two functions. For example $x^2 \cdot \sin x$.
$f'(x) \, g(x)$: "First one's derivative times the second one, untouched."
$f(x) \, g'(x)$: "First one, untouched, times the second one's derivative."

Why this shape. Think of a rectangle of area $A = f \cdot g$, where both sides are changing. Bumping $f$ slightly grows the rectangle by a strip of width $g$ (area $\Delta f \cdot g$). Bumping $g$ grows it by a strip of width $f$ (area $f \cdot \Delta g$). There's also a tiny corner bit $\Delta f \cdot \Delta g$ — but in the limit, that corner is a second-order small quantity and vanishes. Only the two strips survive. That's the product rule.

Quotient rule. For division, apply the product rule to $f \cdot (1/g)$ and simplify — or just memorize:

\frac{d}{dx}\left(\frac{f(x)}{g(x)}\right) = \frac{f'(x)\, g(x) - f(x)\, g'(x)}{g(x)^2}.

Quotient rule

$\dfrac{f(x)}{g(x)}$: One function divided by another — assuming $g(x) \ne 0$, since division by zero is undefined.
$f'(x)\,g(x) - f(x)\,g'(x)$: Numerator: "derivative of top times bottom, minus top times derivative of bottom." Note the minus sign — the order matters here, unlike the product rule.
$g(x)^2$: Denominator: the bottom, squared.

Mnemonic. "Low dee-high minus high dee-low, over the square of what's below." You will say this sentence in your head for the rest of your life.

Worked example

Let's differentiate $f(x) = 3x^4 - 5x^2 + 7$.

By the sum and constant-multiple rules, we can handle each term separately. By the power rule:

$\frac{d}{dx}(3x^4) = 3 \cdot 4 x^3 = 12 x^3$
$\frac{d}{dx}(-5x^2) = -5 \cdot 2 x = -10 x$
$\frac{d}{dx}(7) = 0$ (the derivative of a constant is zero)

So $f'(x) = 12 x^3 - 10 x$. That's the slope of the tangent line to $f$ at any $x$. At $x = 1$, the slope is $12 - 10 = 2$. At $x = 0$, the slope is $0$ (the curve is momentarily flat). At $x = -1$, the slope is $-12 + 10 = -2$ (the curve is descending).

5. The chain rule — differentiating compositions

The chain rule deserves its own section because it is the most important rule in calculus for anything computational, and because it's the one most students get wrong the first few times.

Suppose you have a composition — a function inside another function. Example: $h(x) = \sin(x^2)$. Here you first square $x$, then take the sine of the result. In function notation, $h = f \circ g$ where $g(x) = x^2$ (the inside) and $f(u) = \sin(u)$ (the outside).

The chain rule says:

\frac{d}{dx}\bigl(f(g(x))\bigr) = f'(g(x)) \cdot g'(x).

Chain rule (Lagrange form)

$f(g(x))$: A function of a function — first apply $g$ to $x$, then apply $f$ to the result.
$f'(g(x))$: The derivative of the outer function $f$, evaluated at the inner function's output $g(x)$. Read it as "$f$-prime, with the whole expression $g(x)$ plugged in where $x$ would normally go."
$g'(x)$: The derivative of the inner function at $x$.
$\cdot$: Multiplication. The chain rule is a product of two derivatives, not a sum.

Pictorial intuition. Imagine gears. The inner function $g$ turns the inside gear — a tiny bump in $x$ causes a bump of size $g'(x)$ in $u = g(x)$. That bump then drives the outer gear: a bump of $g'(x)$ in $u$ causes a bump of $f'(u) \cdot g'(x)$ in the final output. You multiply the gear ratios. That's the chain rule.

In Leibniz notation the chain rule looks even simpler — almost like fraction cancellation:

\frac{df}{dx} = \frac{df}{du} \cdot \frac{du}{dx}, \quad \text{where } u = g(x).

Chain rule (Leibniz form)

$u$: The intermediate variable — whatever the inner function returns. Here $u = g(x)$.
$\dfrac{df}{du}$: How fast $f$ changes per unit change in $u$ (the outer rate).
$\dfrac{du}{dx}$: How fast $u$ changes per unit change in $x$ (the inner rate).
$\dfrac{df}{dx}$: How fast $f$ changes per unit change in $x$, the quantity we want.

It's literally unit cancellation. Read the right side as "(change in $f$ per change in $u$) times (change in $u$ per change in $x$)." The "$du$"s visually cancel and you're left with "change in $f$ per change in $x$." Leibniz designed the notation exactly so that this would work.

Back to $h(x) = \sin(x^2)$. Let $u = x^2$ so $\frac{du}{dx} = 2x$. Then $h = \sin u$ so $\frac{dh}{du} = \cos u = \cos(x^2)$. Multiply:

h'(x) = \cos(x^2) \cdot 2x = 2x\cos(x^2).

Chain rule worked example

$\cos(x^2)$: The derivative of the outer function (sine) evaluated at the inner output. The derivative of sine is cosine.
$2x$: The derivative of the inner function ($x^2$), by the power rule.
$2x\cos(x^2)$: Their product. The answer.

Common bug. Beginners often stop at $\cos(x^2)$ and forget to multiply by the derivative of the inside. That missing factor is what separates someone who passes calculus from someone who doesn't. Always ask: "what's the derivative of what's inside?"

Why this matters for neural networks

A neural network is a giant composition of functions: take input $x$, multiply by a weight matrix, apply a nonlinear activation, multiply by another matrix, activate again, and so on. The whole network is $f_L \circ f_{L-1} \circ \cdots \circ f_1(x)$ for some large number of layers $L$. Training the network means adjusting the weights to reduce a loss — which means computing the derivative of the loss with respect to each weight deep inside the composition.

The chain rule, applied recursively from the output backward to the input, is exactly how those derivatives are computed. This recursive application is called backpropagation, and it's the algorithm that made modern deep learning tractable. Every time you see a training curve descending, the chain rule is quietly running underneath, multiplying gradients layer by layer. When it's your job to debug exploding or vanishing gradients, you are debugging long chain rule products with terms that are too big or too small.

If you remember nothing else from this page, remember this: the chain rule is how local information (slopes of individual pieces) combines into global information (the slope of the whole composed thing). That idea is everywhere in computation.

6. Interactive: watch the derivative trace itself out

Here's the payoff. Below is a plot of a function $f(x) = 0.4x^3 - 1.2x$ — a cubic with a bump and a dip. The slider moves a point $x_0$ along the curve. At each position, you see:

The point $(x_0, f(x_0))$ highlighted on the curve.
The tangent line through that point, with slope $f'(x_0)$ — the line that locally approximates the curve.
A second plot below showing the derivative $f'(x)$ as a separate curve, with a marker at $(x_0, f'(x_0))$. As you drag the slider, that marker traces out the derivative, point by point.

Drag the slider slowly from left to right. You're watching the derivative get built in real time from the tangent slopes of the original.

Tangent point x₀: -1.40

Top: the function $f(x) = 0.4x^3 - 1.2x$ and its tangent line at $x_0$. Bottom: the derivative $f'(x) = 1.2x^2 - 1.2$, with a marker at $x_0$ that traces the curve as you drag.

Things to notice as you drag:

When the original curve has a peak (near $x_0 = -1$), the tangent is flat — slope zero. The marker in the bottom plot sits exactly on the $x$-axis. This is why optimization looks for zeros of the derivative: they mark candidate peaks and valleys.
Between the peak and the valley, the curve is descending and the tangent has negative slope. The marker sits below the axis.
When the curve is steeply rising (toward either end), the tangent is steep and the marker is far from the axis.
The derivative is itself a smooth curve — a parabola in this case — because the original is smooth. Smoothness propagates.

7. Uses of derivatives

Now that you know what a derivative is and how to compute one, here are four things people actually do with them.

7.1 Optimization: finding minima and maxima

At a local maximum or minimum of a smooth function, the tangent line is horizontal — slope zero. So if you're hunting for extrema, solve $f'(x) = 0$. Each solution is a candidate; you then check whether it's a peak, a valley, or a flat inflection by looking at the second derivative or testing points nearby.

Example. Find the minimum of $f(x) = x^2 - 4x + 7$. The derivative is $f'(x) = 2x - 4$. Set to zero: $2x - 4 = 0 \Rightarrow x = 2$. Plug back in: $f(2) = 4 - 8 + 7 = 3$. The minimum is $3$ at $x = 2$. You just did constrained optimization with high-school algebra, which is the whole idea.

For real-world problems with millions of parameters (like training neural networks), you can't solve $f'(x) = 0$ symbolically. Instead you use gradient descent: start at a random point, compute the derivative, take a small step in the direction of steepest decrease, repeat. The derivative tells you which way is downhill.

7.2 Linear approximation

Near a point $x_0$, any smooth function is approximately linear — it looks like its tangent line. That approximation is:

f(x) \approx f(x_0) + f'(x_0)(x - x_0).

Linearization

$f(x_0)$: The value of the function at your chosen base point.
$f'(x_0)$: The slope of the tangent line there.
$(x - x_0)$: How far you've moved from the base point.
$\approx$: "Approximately equal to." The approximation is exact at $x_0$ and gets worse as you move away, but for small moves it's often excellent.

Why this matters. Every time you hear the phrase "first-order approximation" or "small perturbation," this is the formula being invoked. Physics builds entire simulations out of it. Numerical methods like Newton's method for root-finding (next) use it as their core step. When you see $f(x+h) \approx f(x) + h f'(x)$ in a paper, no new idea is being introduced — it's just linearization rearranged.

7.3 Newton's method for root-finding

You want to solve $f(x) = 0$ but the equation is too nasty for algebra. Newton's method: guess a value $x_n$, draw the tangent line there, and see where the tangent crosses zero. Use that crossing as your next guess, $x_{n+1}$. Repeat.

x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}.

Newton's iteration

$x_n$: Your current guess for the root. $n$ is the iteration counter — $x_0$ is the initial guess, $x_1$ the first refinement, and so on.
$f(x_n)$: The function value at your current guess. You want this to be zero.
$f'(x_n)$: The slope of the tangent at your current guess.
$\dfrac{f(x_n)}{f'(x_n)}$: A correction term, derived by asking: "where does the tangent line cross zero?" Rise over run means the run you need is $-f(x_n)/f'(x_n)$.

Why it converges fast. When you're close to a root, Newton's method typically doubles the number of correct digits each iteration — quadratic convergence. It's the workhorse behind square-root and reciprocal instructions in hardware, and behind iterative solvers in numerical analysis. The catch: it needs a decent starting guess and a non-zero derivative nearby, or it can bounce off into the weeds.

7.4 Rates of change everywhere

Anywhere a quantity changes with respect to another, the derivative is the name of that rate. Velocity is the derivative of position with respect to time. Acceleration is the derivative of velocity. Marginal cost in economics is the derivative of total cost with respect to quantity produced. Power in physics is the derivative of energy with respect to time. Current is the derivative of charge. The derivative is a shape-shifter — the symbols stay the same and the real-world meaning comes from what function you're differentiating.

8. Integrals — adding up infinitely many pieces

Back to the pond with the wiggly edge. You want its area. Here's the plan: approximate the pond with a bunch of thin rectangles, add up their areas, then make the rectangles thinner and thinner and take a limit. That's integration.

Specifically, suppose you have a function $f(x) \geq 0$ on an interval $[a, b]$ and you want the area of the region trapped between the curve $y = f(x)$, the $x$-axis, and the vertical lines $x = a$ and $x = b$.

Slice the interval into $n$ equal subintervals of width $\Delta x = (b-a)/n$. On each subinterval, pick a sample $x$-value $x_i^*$ and build a rectangle of height $f(x_i^*)$ and width $\Delta x$. The rectangle approximates the sliver of area under the curve in that subinterval. Add up all $n$ rectangles:

S_n = \sum_{i=1}^n f(x_i^*) \, \Delta x.

Riemann sum

$\sum_{i=1}^n$: "Sum from $i = 1$ to $n$." The capital Greek sigma means "add up all these terms as the index $i$ runs from 1 to $n$." So $\sum_{i=1}^3 i = 1 + 2 + 3 = 6$.
$n$: The number of rectangles you're using. Bigger $n$ means thinner rectangles and a better approximation.
$\Delta x$: The width of each rectangle, equal to $(b-a)/n$.
$x_i^*$: The sample point inside the $i$-th subinterval. Typical choices: the left endpoint, the right endpoint, or the midpoint. For continuous $f$, it doesn't matter in the limit.
$f(x_i^*)$: The height of the $i$-th rectangle, which is the function's value at the sample point.
$f(x_i^*) \, \Delta x$: Height times width — the area of one rectangle.
$S_n$: The sum of those rectangle areas — an approximation to the true area under the curve.

Why this works. For modest $n$, you're missing some area (where the curve rises above a rectangle's top) and over-counting some (where the curve drops below). For huge $n$, those errors shrink faster than you can blink. The curve's wiggles get drowned by the sheer thinness of each rectangle.

Now take the limit as $n \to \infty$. If that limit exists and doesn't depend on how you chose the sample points, we call it the definite integral of $f$ from $a$ to $b$, written:

\int_a^b f(x)\, dx = \lim_{n \to \infty} \sum_{i=1}^n f(x_i^*) \, \Delta x.

The definite integral

$\displaystyle\int$: The integral sign — a stretched-out "S" for "sum." Leibniz picked this letter on purpose to remind you that integration is a kind of infinite sum.
$a, b$: The lower and upper limits of integration — the left and right endpoints of the interval over which you're adding up.
$f(x)$: The integrand — the function you're integrating.
$dx$: Think of this as "an infinitesimally thin slice of width $dx$." It's the limit version of $\Delta x$. The "$dx$" also tells you the variable of integration — in this case, $x$.

Picture in your head. $\int_a^b f(x)\, dx$ is the total area under the curve $y = f(x)$ between $x = a$ and $x = b$. If $f$ dips below the axis, the area there counts as negative — the integral is a signed area. If you integrate a velocity, you get a displacement. If you integrate a rate of flow, you get a total volume. Integration is the machinery for "adding up a rate over a duration to get a total."

Antiderivatives — integration as the reverse of differentiation

There is a second, completely different use of the word "integral" — the antiderivative. An antiderivative of $f$ is a function $F$ whose derivative is $f$. That is, $F'(x) = f(x)$. You might also hear this called the indefinite integral and write it

\int f(x)\, dx = F(x) + C.

Indefinite integral (antiderivative)

$F(x)$: Any function whose derivative is $f(x)$. For $f(x) = 2x$, one antiderivative is $F(x) = x^2$.
$C$: An arbitrary constant — the "constant of integration." Since the derivative of a constant is zero, $x^2 + 1$, $x^2 - 7$, and $x^2 + \pi$ are all antiderivatives of $2x$. We write the $+ C$ to indicate the whole family.
No $a, b$: The indefinite integral has no limits on the integral sign — that's how you tell it apart from a definite integral. It returns a function, not a number.

Easy to confuse, important to separate. Definite integral $\int_a^b f(x) dx$ is a number (an area). Indefinite integral $\int f(x) dx$ is a function (plus a constant). They are linked by the Fundamental Theorem of Calculus, which is the next section, and which explains why both things got named "integral." Until that theorem is on the table, the link is just asserted: "reversing differentiation" and "computing areas" happen to be the same operation. Nobody would've guessed.

9. The Fundamental Theorem of Calculus

Named "Fundamental" because it's the bridge. It comes in two parts.

FTC PART 1

If $f$ is continuous on $[a, b]$ and we define a new function $F$ by $F(x) = \int_a^x f(t)\, dt$ — "the area from $a$ up to $x$" — then $F$ is differentiable, and $F'(x) = f(x)$. In words: differentiating an integral undoes the integration.

\frac{d}{dx} \int_a^x f(t) \, dt = f(x).

FTC Part 1

$\int_a^x f(t)\, dt$: The area under $f$ from the fixed lower limit $a$ up to a variable upper limit $x$. As $x$ changes, so does the area; the result is a function of $x$.
$t$: A dummy variable of integration. It's just a placeholder for the input to $f$ as we add up. We use $t$ instead of $x$ here because $x$ is already being used as the upper limit — reusing the same letter would be confusing.
$\dfrac{d}{dx}$: Differentiate the whole expression with respect to $x$.
$f(x)$: The original integrand, evaluated at the upper limit.

Why it's true (sketch). Suppose you've already found the area from $a$ to $x$, and you nudge $x$ slightly to $x + h$. You add a new strip of area, approximately $f(x) \cdot h$ (height times thin width). Divide by $h$ to get the rate of change, and you get $f(x)$. Take the limit as $h \to 0$ and the approximation becomes exact. The rate at which the area is accumulating is whatever height the curve currently has.

FTC PART 2

If $F$ is any antiderivative of $f$ on $[a, b]$, then $\int_a^b f(x)\, dx = F(b) - F(a)$. In words: to compute a definite integral, find any antiderivative and subtract its values at the endpoints.

\int_a^b f(x)\, dx = F(b) - F(a), \quad \text{where } F'(x) = f(x).

FTC Part 2

$\int_a^b f(x)\, dx$: The definite integral — the exact, limit-based area under $f$ from $a$ to $b$.
$F$: Any antiderivative of $f$ — any function whose derivative is $f$. Several work; the $+C$ cancels when you subtract.
$F(b) - F(a)$: The difference of the antiderivative at the two endpoints. Sometimes written $F(x) \big|_a^b$ as shorthand.

Why this is a miracle. The left side is a horrifying infinite sum — slice into $n$ pieces, take a limit, etc. The right side is two function evaluations and a subtraction. The Fundamental Theorem says they're equal. Every time you compute an area in a homework problem by finding an antiderivative and subtracting, you're using it. Archimedes would have killed for this theorem.

Here's FTC Part 2 in action. We want the area under $f(x) = x^2$ from $0$ to $3$.

Find an antiderivative. Since $\frac{d}{dx}\left(\frac{x^3}{3}\right) = x^2$, we can take $F(x) = \frac{x^3}{3}$.
Evaluate at the endpoints: $F(3) = \frac{27}{3} = 9$ and $F(0) = 0$.
Subtract: $\int_0^3 x^2 \, dx = 9 - 0 = 9$.

That's the exact area under the parabola from 0 to 3. No infinite sums, no limits — just the antiderivative at two points. It really is a miracle the first time you see it.

Taken together, the two parts say: differentiation and integration are inverse operations. Part 1 says differentiating an area function gives back the integrand. Part 2 says integrating a derivative gives back the original function (up to endpoints). One operation undoes the other.

10. Techniques of integration

Differentiation is mostly mechanical — apply rules, get an answer. Integration is harder. Many functions don't have antiderivatives expressible in elementary form at all ($e^{-x^2}$ is a famous example — its antiderivative is the error function, which has to be defined separately). For the cases that do work out, there are three tricks that cover most of the territory.

10.1 u-substitution (the chain rule, run backwards)

Any time you see a composition inside an integral and the derivative of the inner function also hanging around, u-substitution works. You rename the inner piece as $u$ and convert the whole integral to one in $u$.

Example. Compute $\int 2x \cos(x^2)\, dx$. Let $u = x^2$. Then $\frac{du}{dx} = 2x$, which we can rearrange as $du = 2x \, dx$. The integral becomes

\int \cos(u) \, du = \sin(u) + C = \sin(x^2) + C.

u-substitution

$u$: A new variable that stands for the inner function — in this example, $u = x^2$.
$du$: The differential of $u$, equal to $\frac{du}{dx}\, dx$. It bookkeeps the scaling factor that changing variables introduces.
$\sin(u) + C$: The antiderivative of $\cos(u)$, since $\frac{d}{du}(\sin u) = \cos u$.
$\sin(x^2) + C$: Substitute $x^2$ back in for $u$ at the end — you want your answer in the original variable.

What it really is. The chain rule says $\frac{d}{dx}\sin(x^2) = \cos(x^2) \cdot 2x$. u-substitution is that equation read right-to-left: if you see $\cos(x^2) \cdot 2x$ in an integral, you know the antiderivative is $\sin(x^2)$. Same rule, different direction.

10.2 Integration by parts (the product rule, run backwards)

The product rule for derivatives gives $(uv)' = u'v + uv'$. Integrate both sides from $a$ to $b$ and rearrange, and you get:

\int u \, dv = u v - \int v \, du.

Integration by parts

$u$ and $v$: Two functions of $x$. You choose which part of the integrand to call $u$ and which to call $dv$.
$dv$: The "differential of $v$" — whatever $v$ was, $dv = v'(x)\, dx$.
$du$: Similarly, $du = u'(x)\, dx$.
$uv - \int v\, du$: The boundary term $uv$ plus a new integral you hope is simpler than the one you started with.

The trade. You swap one integral ($\int u\, dv$) for another ($\int v\, du$). The trick is to pick $u$ and $dv$ so that the new integral is easier than the original. If you pick badly, you just go in circles. A common heuristic: let $u$ be whatever gets simpler when you differentiate (like $\ln x$ or $x^n$), and let $dv$ be whatever you can integrate easily (like $e^x\, dx$ or $\sin x\, dx$).

Example. $\int x e^x \, dx$. Let $u = x$ (so $du = dx$) and $dv = e^x\, dx$ (so $v = e^x$). Then

\int x e^x \, dx = x \cdot e^x - \int e^x \, dx = x e^x - e^x + C = (x-1) e^x + C.

Integration by parts worked out

$x e^x$: The boundary term $uv$. Comes for free.
$\int e^x \, dx$: The new integral — much easier than the original. An antiderivative of $e^x$ is just $e^x$, since $e^x$ is the unique function equal to its own derivative.
$(x - 1)e^x + C$: The final answer, after cleanup. You can check it by differentiating: $\frac{d}{dx}[(x-1)e^x] = e^x + (x-1)e^x = x e^x$. Correct.

The pattern. Integration by parts is your tool for integrals where the integrand is a product of two different kinds of functions — polynomial times exponential, polynomial times trig, log times polynomial. It cuts the degree of one factor per application. Sometimes you have to apply it twice.

10.3 Partial fractions (for rational functions)

To integrate a rational function like $\frac{1}{(x-1)(x+2)}$, split it into simpler pieces you already know how to integrate. Algebra gives $\frac{1}{(x-1)(x+2)} = \frac{1/3}{x-1} - \frac{1/3}{x+2}$, and each piece is a standard form: $\int \frac{1}{x - a}\, dx = \ln|x - a| + C$. So the integral splits into two logarithms. The technique is called partial fraction decomposition, and it's how you turn scary rational functions into sums of logarithms and inverse tangents. You won't need to master it to follow this site; just know the name and what it's for.

11. Taylor series — polynomial approximation of everything

Here's a question that sounds crazy: can you approximate any smooth function near a point using just a polynomial? Turns out: yes, and the recipe uses derivatives.

The Taylor series of $f$ around a point $a$ is:

f(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots

Taylor series

$f(a)$: The constant term — the value of $f$ at the expansion point $a$. Matches $f$ at $x = a$.
$f'(a)(x - a)$: The linear term. Adding this gives you the tangent line through $(a, f(a))$ — a first-order approximation.
$f''(a)$: The second derivative at $a$ — the derivative of the derivative. Measures how the slope itself is changing; geometrically, the concavity (curving up or down).
$2!$: "2 factorial," equal to $2 \cdot 1 = 2$. In general $n! = n(n-1)\cdots 1$.
$(x - a)^n$: The $n$-th power of "distance from the expansion point." For $x$ close to $a$, higher powers shrink fast, so higher-order terms contribute less and less.
$f^{(n)}(a)$: The $n$-th derivative of $f$ evaluated at $a$. Computed by differentiating $n$ times in a row.
$\cdots$: Continues forever. The full series has a term for every $n = 0, 1, 2, 3, \dots$

What's going on. You're building a polynomial that matches $f$ at $a$: same value, same slope, same curvature, same rate-of-change-of-curvature, and so on. Each new term fixes one more derivative. For many functions, if you include enough terms, the polynomial gets arbitrarily close to $f$ on some interval around $a$. Practical consequence: you can replace a complicated function with a few terms of its Taylor series and get a cheap, accurate approximation.

Special case: when $a = 0$, the series is called a Maclaurin series:

f(x) = f(0) + f'(0)\, x + \frac{f''(0)}{2!} x^2 + \frac{f'''(0)}{3!} x^3 + \cdots

Maclaurin series (Taylor around 0)

$f(0), f'(0), f''(0), \dots$: The function and all its derivatives evaluated at $x = 0$.
$x, x^2, x^3, \dots$: The powers of $x$ itself, since $(x - 0)^n = x^n$.

Three series you'll meet everywhere. For $e^x$: $1 + x + x^2/2 + x^3/6 + \cdots$. For $\sin x$: $x - x^3/6 + x^5/120 - \cdots$. For $\cos x$: $1 - x^2/2 + x^4/24 - \cdots$. Each converges for all real $x$. These expansions are how calculators compute transcendental functions, and how physicists expand potentials near equilibrium, and how numerical analysts design integration schemes. When in doubt — Taylor expand.

Why this matters for the rest of the site:

Numerical methods like Runge-Kutta integrators or finite-difference stencils are built by matching Taylor series term-by-term. See numerical analysis.
Physics linearizations — pendulums, small oscillations, perturbation theory — are Taylor expansions truncated at the first interesting term.
Gradient descent's update rule comes from a first-order Taylor expansion of the loss; Newton's method uses a second-order one.
Activation functions like GELU involve the error function, which is defined through its Taylor series.

12. Calculus in code

Here are three calculus primitives in code: a finite-difference approximation of the derivative, a trapezoidal-rule numerical integral, and a Taylor expansion of $\sin x$. All three are tiny, all three are instructive.

calculus primitives

import numpy as np

# ---------- 1. Numerical derivative via central differences ----------
def deriv(f, x, h=1e-5):
    # Central difference is second-order accurate: error ~ h^2.
    # The symmetric version cancels the leading error term that
    # (f(x+h) - f(x)) / h would carry.
    return (f(x + h) - f(x - h)) / (2 * h)

# Try it on f(x) = x^3. True derivative is 3 x^2.
f  = lambda x: x ** 3
x0 = 2.0
print(f"approx f'(2) = {deriv(f, x0):.6f}")   # ~ 12.000000
print(f"exact  f'(2) = {3 * x0 ** 2}")         #   12.0

# ---------- 2. Numerical integral via the trapezoidal rule ----------
def integrate_trap(f, a, b, n=1000):
    # Split [a, b] into n panels. Each panel is a trapezoid whose
    # area is the average of its two endpoint heights times its width.
    xs = np.linspace(a, b, n + 1)
    ys = f(xs)
    dx = (b - a) / n
    return dx * (0.5 * ys[0] + ys[1:-1].sum() + 0.5 * ys[-1])

# Integrate x^2 from 0 to 3. Exact answer is 27/3 = 9.
print(f"trap    ≈ {integrate_trap(lambda x: x ** 2, 0, 3):.6f}")
print(f"exact   = 9.0")

# ---------- 3. Taylor series of sin(x) around 0 ----------
def taylor_sin(x, terms=10):
    # sin x = x - x^3/3! + x^5/5! - x^7/7! + ...
    total = 0.0
    sign  = 1.0
    fact  = 1.0
    power = x
    for k in range(terms):
        n = 2 * k + 1                      # 1, 3, 5, 7, ...
        if k > 0:
            power *= x * x
            fact  *= n * (n - 1)
        total += sign * power / fact
        sign  *= -1
    return total

print(f"taylor sin(1) = {taylor_sin(1.0):.8f}")
print(f"numpy  sin(1) = {np.sin(1.0):.8f}")

import math

# Same three primitives without NumPy, for readers without it installed.

def deriv(f, x, h=1e-5):
    return (f(x + h) - f(x - h)) / (2 * h)

def integrate_trap(f, a, b, n=1000):
    dx = (b - a) / n
    total = 0.5 * (f(a) + f(b))
    for i in range(1, n):
        total += f(a + i * dx)
    return dx * total

def taylor_exp(x, terms=20):
    # e^x = sum_{k=0}^inf  x^k / k!
    total = 0.0
    term  = 1.0                              # k = 0 term is 1
    for k in range(terms):
        total += term
        term  *= x / (k + 1)                   # update to the next term
    return total

# Demo
print(f"d/dx (x^3) at x=2: {deriv(lambda x: x ** 3, 2.0):.6f}")
print(f"integral x^2 dx [0,3]: {integrate_trap(lambda x: x ** 2, 0, 3):.6f}")
print(f"taylor e^1 = {taylor_exp(1.0):.8f}  (vs math.e = {math.e:.8f})")

Things worth noting about these little routines:

Central differences are more accurate than one-sided. The symmetric error terms cancel, so the error is $O(h^2)$ instead of $O(h)$ — meaning if you halve $h$, the error shrinks by a factor of 4, not 2. This is a Taylor-series consequence.
The trapezoidal rule replaces each slice's rectangle with a trapezoid (average of left and right heights times width). For smooth integrands, its error shrinks like $O(1/n^2)$ — again, a Taylor-series consequence.
Taylor series converge fast near the expansion point. The $k$-th term of $\sin x$'s series shrinks by a factor of roughly $x^2 / (2k)^2$ per step, so for $|x| < 1$ you only need a handful of terms for machine precision.

These aren't production code. Real libraries like scipy.integrate and jax.grad use cleverer algorithms (Gaussian quadrature, automatic differentiation). But every one of those clever algorithms is built on the same foundation: limits, derivatives, and the Fundamental Theorem.

13. Cheat sheet

Derivative definition

$f'(x) = \displaystyle\lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

Slope of the tangent line.

Power rule

$\displaystyle\frac{d}{dx}(x^n) = n x^{n-1}$

Exponent drops down; reduce by one.

Product rule

$(fg)' = f'g + fg'$

Derivative of first times second, plus first times derivative of second.

Quotient rule

$\left(\dfrac{f}{g}\right)' = \dfrac{f'g - fg'}{g^2}$

Low dee-high minus high dee-low, over low squared.

Chain rule

$\dfrac{d}{dx} f(g(x)) = f'(g(x)) \cdot g'(x)$

Outer's derivative at inner, times inner's derivative.

Trig derivatives

$(\sin x)' = \cos x$, $(\cos x)' = -\sin x$

Memorize these two; the rest follow.

Exponential

$(e^x)' = e^x$, $(\ln x)' = 1/x$

$e^x$ is its own derivative. That's the defining property.

FTC Part 2

$\displaystyle\int_a^b f(x)\, dx = F(b) - F(a)$

For any antiderivative $F$ of $f$.

u-substitution

$\int f(g(x)) g'(x)\, dx = \int f(u)\, du$

Chain rule in reverse.

Integration by parts

$\int u\, dv = uv - \int v\, du$

Product rule in reverse.

Taylor series

$f(x) = \sum_n \dfrac{f^{(n)}(a)}{n!}(x-a)^n$

Polynomial approximation matching all derivatives at $a$.

Linear approximation

$f(x) \approx f(a) + f'(a)(x - a)$

The 1st-order Taylor truncation. Powers everything "to first order."

Calculus

1. Why calculus exists

2. Vocabulary cheat sheet

3. Limits — the one idea everything else is built on

$\epsilon$–$\delta$: the idea without the formalism

One-sided limits, infinite limits, continuity

4. Derivatives — instantaneous rates of change

Notation, and why it's a mess

The basic rules (the ones you'll use constantly)

Worked example

5. The chain rule — differentiating compositions

Why this matters for neural networks

6. Interactive: watch the derivative trace itself out

7. Uses of derivatives

7.1 Optimization: finding minima and maxima

7.2 Linear approximation

7.3 Newton's method for root-finding

7.4 Rates of change everywhere

8. Integrals — adding up infinitely many pieces

Antiderivatives — integration as the reverse of differentiation

9. The Fundamental Theorem of Calculus

10. Techniques of integration

10.1 u-substitution (the chain rule, run backwards)

10.2 Integration by parts (the product rule, run backwards)

10.3 Partial fractions (for rational functions)

11. Taylor series — polynomial approximation of everything

12. Calculus in code

13. Cheat sheet

Derivative definition

Power rule

Product rule

Quotient rule

Chain rule

Trig derivatives

Exponential

FTC Part 2

u-substitution

Integration by parts

Taylor series

Linear approximation

See also

Linear algebra

Numerical analysis

Gradient descent

Backpropagation

Activation functions

Optimization (CS)

Further reading