Linear Algebra

The math of arrows, grids, and lists of numbers — and the single most important subject for anyone who wants to understand modern machine learning. By the end of this page you should be able to read a matrix expression, know what it does geometrically, and see why every line of a neural network is quietly doing linear algebra.

Prereq: basic algebra (no calculus needed) Read time: ~40 min Interactive figures: 1 Code: NumPy

1. Why linear algebra

Here is the headline: linear algebra is the language every modern ML model, every 3D graphics pipeline, and every signal processing system is written in. When someone says "GPT-4 has 1.7 trillion parameters," they mean that a collection of matrices has 1.7 trillion numbers in it. When someone says "the embedding of dog," they mean a vector. When someone says "attention," they mean dot products. When someone says "a forward pass of a neural network," they mean a sequence of matrix multiplications with some non-linear squashing in between.

If you only remember three sentences from this page, remember these:

If you already know what a vector is, skim sections 3–6 and dive into eigenvalues and SVD. If you've never seen any of this — or haven't seen it since school — read straight through. Nothing here assumes calculus; we'll point to calculus.html only where we mention derivatives, and you can ignore those footnotes on a first pass.

The five things that break people's brains the first time, so you know to slow down:

  1. Matrix multiplication is not commutative. $AB \ne BA$ in general. This is not a typo or a convention — it reflects the fact that rotating then scaling is different from scaling then rotating.
  2. Matrix multiplication looks like row · column, but the real meaning is linear combinations of columns.
  3. An "eigenvector" is just "a direction the matrix doesn't rotate, only stretches." That's it. The Greek is scarier than the concept.
  4. The determinant is the amount by which a matrix scales volume. Sign = orientation. That's it.
  5. SVD says every matrix, no matter how ugly, is secretly a rotation, then a scaling along axes, then another rotation.

2. Vocabulary cheat-sheet

The notation is ugly and inconsistent across textbooks. Here is the set of conventions this page uses, and which we recommend you steal.

SymbolMeansExample
$\mathbf{v}$ or $\vec v$A vector (lowercase, bold or with an arrow)$\mathbf{v} = (3, -1, 4)$
$v_i$The $i$-th component (scalar) of vector $\mathbf{v}$$v_2 = -1$
$A, B, M$A matrix (uppercase)$A = \begin{pmatrix}1&2\\3&4\end{pmatrix}$
$A_{ij}$Entry in row $i$, column $j$ of $A$$A_{12} = 2$
$A^T$Transpose — rows and columns swapped$(A^T)_{ij} = A_{ji}$
$A^{-1}$Inverse — the matrix that undoes $A$$A A^{-1} = I$
$I$ or $I_n$Identity matrix (1s on the diagonal, 0 elsewhere)$I_2 = \begin{pmatrix}1&0\\0&1\end{pmatrix}$
$\|\mathbf{v}\|$Length (norm) of $\mathbf{v}$$\|(3,4)\| = 5$
$\mathbf{a} \cdot \mathbf{b}$Dot product — a scalar$(1,2)\cdot(3,4) = 11$
$\mathbb{R}^n$The set of $n$-dimensional real vectors$(1.5, 2, -3) \in \mathbb{R}^3$
$\det(A)$ or $|A|$Determinant of $A$$\det(I_2) = 1$
$\lambda$An eigenvalue (lowercase Greek lambda)$A\mathbf{v} = \lambda \mathbf{v}$

If a symbol appears on this page and you can't find it in this table, we'll define it inline where it appears. If we don't, report the bug.

3. Vectors

A vector is two things that happen to be the same thing:

When you do linear algebra, you want both pictures in your head at once. The list is how you compute; the arrow is how you reason.

A 2D vector $\mathbf{v} = (v_1, v_2)$ is an arrow from $(0,0)$ to the point $(v_1, v_2)$ in the plane. A 3D vector $(v_1, v_2, v_3)$ is an arrow in space. A 768-dimensional vector — the default size of a GPT embedding — is an arrow in a space you cannot visualize, but that still obeys the same rules. That's the whole trick: once you know how 2D works, the algebra carries you up into any dimension for free.

Vector addition

You add vectors component-by-component:

$$\mathbf{a} + \mathbf{b} = (a_1 + b_1,\ a_2 + b_2,\ \dots,\ a_n + b_n)$$

Vector addition

$\mathbf{a}, \mathbf{b}$
Two vectors of the same length $n$. You can't add a 3-vector to a 4-vector.
$a_i, b_i$
The $i$-th components (plain numbers) of $\mathbf{a}$ and $\mathbf{b}$.
$(\dots)$
The result is a new vector whose $i$-th component is $a_i + b_i$.

Geometry. Put the tail of $\mathbf{b}$ at the tip of $\mathbf{a}$. The arrow from the origin to the new tip is $\mathbf{a} + \mathbf{b}$. Equivalently: addition is the diagonal of the parallelogram they span. This is called the tip-to-tail rule and it's the geometric content of the component-wise formula above.

Drag the tips of a and b to see the tip-to-tail construction and the sum a+b. The dashed arrow shows b placed at the tip of a; the parallelogram diagonal (solid yellow) is the sum.

Scalar multiplication

Multiplying a vector by a plain number (a scalar) stretches or squashes it:

$$c \cdot \mathbf{v} = (c v_1,\ c v_2,\ \dots,\ c v_n)$$

Scalar multiplication

$c$
A scalar (a single real number). Could be $2$, $-3$, or $0.5$.
$\mathbf{v}$
Any vector.
$c v_i$
Ordinary number multiplication applied component-by-component.

Geometry. $c > 1$ stretches the arrow; $0 < c < 1$ shrinks it; $c < 0$ flips it through the origin. The direction is preserved (or reversed), the length is multiplied by $|c|$.

1.50

Drag the slider to scale vector v. When c > 0 the scaled vector cv points the same direction; when c < 0 it flips. The original v stays fixed for reference.

These two operations — addition and scalar multiplication — are the only operations that define what "linear" means. Everything else in this subject is built out of them.

Norms: how long is a vector?

A norm is any reasonable way to measure the length of a vector. There are several, and they show up in different ML contexts.

The default, and the one you already know from Pythagoras, is the Euclidean norm or $\ell_2$ norm:

$$\|\mathbf{v}\|_2 = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2} = \sqrt{\sum_{i=1}^n v_i^2}$$

The Euclidean norm

$\|\mathbf{v}\|_2$
The "length" of $\mathbf{v}$ that matches ordinary geometric distance.
$v_i$
The $i$-th component of $\mathbf{v}$.
$\sum_{i=1}^n v_i^2$
Sum of the squares of every component. The "$\sum$" is a summation symbol — it means "add up everything inside for $i$ from 1 to $n$".
$\sqrt{\cdot}$
Square root. Without it you'd have "squared length," which is sometimes more convenient to compute but is not an actual length.

Why the subscript 2? Because you sum the components raised to the 2nd power. That hints at a family of norms.

The $\ell_1$ norm adds absolute values instead:

$$\|\mathbf{v}\|_1 = |v_1| + |v_2| + \dots + |v_n| = \sum_{i=1}^n |v_i|$$

The $\ell_1$ norm (aka "taxicab" or "Manhattan" distance)

$\|\mathbf{v}\|_1$
Sum of the absolute values of the components.
$|v_i|$
Absolute value — strip the sign.

Why "Manhattan"? In a city grid you can't cut diagonally across buildings. The distance you actually walk from $(0,0)$ to $(3,4)$ is $3 + 4 = 7$ blocks, not $\sqrt{9+16}=5$. The $\ell_1$ norm shows up in L1-regularization (which encourages sparse weights) and in LASSO regression.

And the $\ell_\infty$ norm takes the max:

$$\|\mathbf{v}\|_\infty = \max_i |v_i|$$

The $\ell_\infty$ (max) norm

$\|\mathbf{v}\|_\infty$
The largest absolute value among the components. Ignores the others.
$\max_i$
"Maximum over $i$" — scan all indices and keep the biggest.

Why the $\infty$? It's the limit of $\ell_p = (\sum |v_i|^p)^{1/p}$ as $p \to \infty$ — the largest term dominates. Shows up in robustness bounds and adversarial examples ("how big is the worst pixel change").

All three answer the question "how big is this vector?" but give different answers. Pick the one that fits your problem. For most things in ML, $\ell_2$ is the default.

4. Dot products

The dot product is the single most important operation in this subject. Learn it and you've learned 40% of linear algebra.

There are two definitions. They're equal. The trick is that each definition tells you something different about why it matters.

The algebraic definition

$$\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + \dots + a_n b_n = \sum_{i=1}^n a_i b_i$$

Dot product (algebraic)

$\mathbf{a}, \mathbf{b}$
Two vectors of the same length $n$. Again, they must match in length.
$a_i b_i$
Ordinary number multiplication of the $i$-th components.
$\sum$
Add them all up. The result is a single number, not a vector.

What the result means. A scalar that measures how "aligned" the two vectors are. Big positive → same direction. Zero → perpendicular. Big negative → opposite directions.

The geometric definition

$$\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\|\,\|\mathbf{b}\|\,\cos\theta$$

Dot product (geometric)

$\|\mathbf{a}\|, \|\mathbf{b}\|$
Euclidean lengths of the two vectors.
$\theta$
The angle between them, measured in radians.
$\cos\theta$
Cosine of the angle. Ranges from $1$ (same direction) to $0$ (perpendicular) to $-1$ (opposite).

Read it as: "how long are they" × "how parallel are they." Two long vectors that point in roughly the same direction have a huge dot product. Two long vectors pointing perpendicularly have zero, because there's no alignment.

The fact that these two formulas are equal is not obvious. It follows from the law of cosines, and it's essentially the entire reason linear algebra works: the thing you can compute (the sum) equals the thing that has geometric meaning (the angle × length).

Rearranging:

$$\cos\theta = \frac{\mathbf{a}\cdot\mathbf{b}}{\|\mathbf{a}\|\,\|\mathbf{b}\|}$$

Cosine similarity

LHS $\cos\theta$
The cosine of the angle between the two vectors.
$\mathbf{a}\cdot\mathbf{b}$
Dot product (algebraic formula).
$\|\mathbf{a}\|\|\mathbf{b}\|$
Product of the two lengths — this normalizes out magnitude so you only compare direction.

Where you've seen it. Embedding search — "find me the documents whose meaning is closest to this query" — is exactly this formula, over millions of vectors. Return the items with the highest cosine similarity. "Semantic search" in one line.

Projections

If $\mathbf{u}$ is a unit vector (length 1), the dot product $\mathbf{a}\cdot\mathbf{u}$ is the length of the shadow $\mathbf{a}$ casts onto the line through $\mathbf{u}$. This is the projection of $\mathbf{a}$ onto $\mathbf{u}$:

$$\text{proj}_{\mathbf{u}}(\mathbf{a}) = (\mathbf{a}\cdot\mathbf{u})\,\mathbf{u}$$

Projection onto a unit vector

$\mathbf{u}$
A unit vector — a direction with length 1.
$\mathbf{a}\cdot\mathbf{u}$
A scalar — how far along the $\mathbf{u}$ direction $\mathbf{a}$ reaches.
$(\mathbf{a}\cdot\mathbf{u})\mathbf{u}$
Scalar times direction = a new vector, pointing along $\mathbf{u}$, with the right length.

Picture. Shine a light straight down onto the $\mathbf{u}$-axis. The shadow of $\mathbf{a}$ is the projection. Everything else — Fourier analysis, least squares, attention, PCA — is projection in fancier clothes.

Drag the tips of a and b. The arc shows the angle θ between them. The dashed line is the projection of a onto b. The dot product value and cosine similarity update live.

Why does attention appear here? In a transformer, the "score" between a query and a key is literally $\mathbf{q}\cdot\mathbf{k}$. High dot product → "these two tokens are relevant to each other" → high attention weight. That's the whole idea. The softmax is window dressing.

5. Matrices

A matrix is a rectangular grid of numbers. If it has $m$ rows and $n$ columns we say it is $m \times n$ and write $A \in \mathbb{R}^{m\times n}$. The entry in row $i$, column $j$ is written $A_{ij}$.

$$A = \begin{pmatrix} A_{11} & A_{12} & A_{13} \\ A_{21} & A_{22} & A_{23} \end{pmatrix} \in \mathbb{R}^{2\times 3}$$

Matrix notation

$A$
The whole matrix. Capital letter.
$A_{ij}$
The scalar in row $i$, column $j$. Rows first, columns second — this convention is universal and you will never stop fighting it.
$\mathbb{R}^{2\times 3}$
"Real-valued matrices with 2 rows and 3 columns." A set.

Numpy hint. A[i, j] is $A_{ij}$, except NumPy counts from 0 so the first entry is $A_{11}$ in math but A[0, 0] in code. Off-by-ones are forever.

Matrix as function: a vector-transforming machine

Here is the point of view that turns all of this into something meaningful, rather than a spreadsheet exercise:

A matrix is a function. Specifically, an $m \times n$ matrix $A$ is a function that takes an $n$-vector as input and produces an $m$-vector as output. You invoke it by writing $A\mathbf{x}$.

And not just any function — a linear function. "Linear" means two properties hold:

$$A(\mathbf{x} + \mathbf{y}) = A\mathbf{x} + A\mathbf{y}, \qquad A(c\mathbf{x}) = c\,A\mathbf{x}$$

Linearity

$A$
A matrix (viewed as a function).
$\mathbf{x}, \mathbf{y}$
Any two input vectors of the right size.
$c$
Any scalar.
First equation
"$A$ of a sum is the sum of the $A$s." Additive.
Second equation
"Scaling the input scales the output by the same amount." Homogeneous.

Geometric consequence. A linear function sends straight lines to straight lines and keeps the origin fixed. Grids stay grids. No curves, no shifts. That restriction is what makes linear algebra so powerful: everything is determined by what the function does to just a few basis vectors.

How do you know a matrix when it's given to you as a function? Apply it to the standard basis vectors and read off the answers:

$$A = \begin{pmatrix} | & | & & | \\ A\mathbf{e}_1 & A\mathbf{e}_2 & \cdots & A\mathbf{e}_n \\ | & | & & | \end{pmatrix}$$

Columns of a matrix

$\mathbf{e}_i$
The $i$-th standard basis vector — all zeros except a 1 in position $i$.
$A\mathbf{e}_i$
"What $A$ does to the unit vector pointing along the $i$-th axis." A column vector.
The block notation
Stacks those output vectors side-by-side as columns.

Crucial takeaway. The columns of $A$ are the images of the standard basis vectors under the function $A$. If you want to hand-write a matrix that rotates 2D space by 90°, you just ask "where does $\hat x$ go?" (answer: $(0,1)$) and "where does $\hat y$ go?" (answer: $(-1,0)$) and stack them as columns. Done.

6. Matrix multiplication

Matrix multiplication is the operation people get most wrong most often. We'll show two pictures, both equally valid. Use whichever makes more sense at the time.

The row · column picture

If $A$ is $m \times k$ and $B$ is $k \times n$, the product $C = AB$ is $m \times n$, with entries:

$$C_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}$$

Matrix product, entry by entry

$C_{ij}$
Entry of $C$ in row $i$, column $j$.
$A_{ip}$
Entries of the $i$-th row of $A$.
$B_{pj}$
Entries of the $j$-th column of $B$.
$\sum_{p=1}^k$
Sum over the shared dimension. The inner dimensions must match: $A$'s column count = $B$'s row count = $k$.

Mnemonic. Entry $(i,j)$ of $AB$ is the dot product of the $i$-th row of $A$ with the $j$-th column of $B$. "Row × column." This is the recipe you computed in school. It's correct, but it doesn't tell you why.

The column picture (this is the one to remember)

Here is the much more useful way to see it. If $B$ has columns $\mathbf{b}_1, \dots, \mathbf{b}_n$, then:

$$AB = A\,[\mathbf{b}_1\ \mathbf{b}_2\ \dots\ \mathbf{b}_n] = [A\mathbf{b}_1\ A\mathbf{b}_2\ \dots\ A\mathbf{b}_n]$$

Columns of $AB$ are $A$ applied to columns of $B$

$\mathbf{b}_j$
The $j$-th column of $B$, viewed as a vector.
$A\mathbf{b}_j$
$A$ applied to that vector — another vector.
$[\cdot\ \cdot\ \cdot]$
Stacking the results side-by-side as the columns of a new matrix.

Why this is the "right" picture. Matrix multiplication is function composition. $(AB)\mathbf{x} = A(B\mathbf{x})$ — first apply $B$ to $\mathbf{x}$, then apply $A$ to the result. The $j$-th column of the product tells you what the composed transformation does to the $j$-th basis vector. This is exactly why it isn't commutative: "rotate then scale" and "scale then rotate" are different composed functions.

A worked example

Let's compute a small one by hand. Take:

$$A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix},\qquad B = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}$$

Setup

$A$
A 2×2 matrix. Rows $(1,2)$ and $(3,4)$.
$B$
A 2×2 matrix. Rows $(5,6)$ and $(7,8)$.

We'll compute $AB$ both ways and confirm the answers agree.

Row × column:

$$AB = \begin{pmatrix} 19 & 22 \\ 43 & 50 \end{pmatrix}$$

Result of $AB$

$19, 22, 43, 50$
The four dot products computed above.

Now let's verify with the column picture. The first column of $B$ is $(5,7)$. $A\cdot(5,7) = (1\cdot 5 + 2\cdot 7,\ 3\cdot 5 + 4\cdot 7) = (19, 43)$. That's the first column of the answer. The second column of $B$ is $(6,8)$. $A\cdot(6,8) = (1\cdot 6 + 2\cdot 8,\ 3\cdot 6 + 4\cdot 8) = (22, 50)$. That's the second column. Agrees.

Now compute $BA$ and watch it come out different:

$$BA = \begin{pmatrix} 23 & 34 \\ 31 & 46 \end{pmatrix} \ne AB$$

Non-commutativity in the flesh

$BA$
Same matrices, multiplied in the other order.

Non-commutativity is a feature, not a bug. It reflects the fact that the order in which you compose transformations matters. Rotate 90°, then reflect across the x-axis, is different from reflect first then rotate. The math tracks reality.

What does multiply (associativity and distributivity)

Matrix multiplication is associative and distributive, but not commutative:

$$(AB)C = A(BC), \quad A(B+C) = AB + AC, \quad AB \ne BA \text{ in general}$$

Algebraic properties

Associativity
You can re-group parentheses in a chain of multiplications. This is crucial in practice because it lets you choose the cheapest order to compute big products.
Distributivity
You can factor a matrix out of a sum. Normal arithmetic rules apply, except for commutativity.
No commutativity
Order matters. Write it right.

Associativity is how frameworks like PyTorch save compute: a chain like $ABC\mathbf{x}$ of sizes $(n\times n)(n\times n)(n\times n)(n \times 1)$ should be computed right-to-left as $A(B(C\mathbf{x}))$, which is three matrix-vector products ($O(n^2)$ each) instead of two matrix-matrix products ($O(n^3)$ each).

7. Interactive: 2×2 transformations

Time to build intuition with your hands. Below is a 2D playground. Adjust the four entries of a 2×2 matrix $M$ with the sliders. The cyan outline is the original unit square with its basis vectors drawn; the violet outline is the image of the unit square under $M$ — what it becomes after you apply the transformation. You'll see the columns of $M$ become the new positions of $\hat x$ and $\hat y$. That is not a coincidence; that's section 5.

Watch for:

$M_{11}$: 1.00 $M_{12}$: 0.00
$M_{21}$: 0.00 $M_{22}$: 1.00

The cyan square is the original unit square. The violet shape is its image under $M$. The two violet arrows are $M$'s columns — the new positions of the $\hat x$ and $\hat y$ basis vectors.

8. Special matrices

Most matrices you meet in the wild are one of a handful of types. Naming them matters, because the algorithms that operate on them are specialized to the type.

Identity matrix $I$

The matrix that does nothing: $I\mathbf{x} = \mathbf{x}$ for any $\mathbf{x}$. It has 1s on the diagonal and 0s everywhere else. The multiplicative identity: $AI = IA = A$.

$$I_3 = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}$$

Identity matrix

$I_n$
The $n\times n$ identity — a square matrix with 1s down the diagonal, 0s off the diagonal.
$I\mathbf{x}=\mathbf{x}$
It leaves every vector alone.

Role. Plays the role of "1" for matrix multiplication. Is never not around.

Zero matrix $\mathbf{0}$

All entries 0. Sends every vector to the zero vector. The additive identity.

Diagonal matrix

Only the diagonal entries can be non-zero. A diagonal matrix $D$ scales each coordinate independently:

$$D = \begin{pmatrix} d_1 & 0 & 0 \\ 0 & d_2 & 0 \\ 0 & 0 & d_3 \end{pmatrix}, \quad D\mathbf{x} = (d_1 x_1,\ d_2 x_2,\ d_3 x_3)$$

Diagonal matrix

$d_i$
Scalars on the diagonal. Everything else is 0.
$D\mathbf{x}$
Component-wise scaling of $\mathbf{x}$.

Why you care. Diagonal matrices are trivial to invert ($d_i \to 1/d_i$), trivial to raise to powers, and their determinant is just the product of the diagonal. Most of linear algebra is spent trying to turn non-diagonal matrices into diagonal ones via changes of basis.

Triangular matrix

All entries above (or all below) the diagonal are 0. Upper triangular looks like:

$$U = \begin{pmatrix} u_{11} & u_{12} & u_{13} \\ 0 & u_{22} & u_{23} \\ 0 & 0 & u_{33} \end{pmatrix}$$

Upper triangular matrix

$u_{ij}$ with $i\le j$
Free — any value.
$u_{ij}$ with $i > j$
Forced to 0.

Why you care. Triangular systems can be solved by back-substitution in $O(n^2)$ time instead of $O(n^3)$. Gaussian elimination exists to turn an arbitrary matrix into a triangular one.

Symmetric matrix

A matrix is symmetric if $A = A^T$, i.e. $A_{ij} = A_{ji}$.

$$A = \begin{pmatrix} 2 & -1 & 3 \\ -1 & 5 & 0 \\ 3 & 0 & 4 \end{pmatrix} = A^T$$

Symmetric matrix

$A = A^T$
The matrix equals its own transpose.

Where they come from. Covariance matrices, Gram matrices $X^T X$, Hessians of smooth functions. Symmetric matrices have a spectacular property: their eigenvectors are orthogonal, and the spectral theorem guarantees you can diagonalize them with a rotation. This is what PCA exploits.

Orthogonal matrix

A square matrix $Q$ is orthogonal if $Q^T Q = I$, i.e. $Q^T = Q^{-1}$. Equivalently, its columns are mutually perpendicular unit vectors.

$$Q^T Q = I \iff Q^{-1} = Q^T$$

Orthogonal matrix

$Q$
A square matrix.
$Q^T$
Its transpose.
$I$
The identity.
$\iff$
"If and only if" — the two statements are equivalent.

Geometry. Orthogonal matrices are exactly the rigid motions of space that fix the origin: rotations and reflections. They preserve lengths ($\|Q\mathbf{x}\| = \|\mathbf{x}\|$) and angles. If your transformation is a rotation, you want an orthogonal matrix. Their inverse costs nothing — just transpose. This is a big deal numerically.

Rotation matrix

In 2D, the matrix that rotates by angle $\theta$ counterclockwise is:

$$R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$

2D rotation matrix

$\theta$
The angle of rotation, in radians, counterclockwise.
$\cos, \sin$
Trigonometric functions.
Columns
First column is where $\hat x$ goes: $(\cos\theta,\sin\theta)$. Second column is where $\hat y$ goes: $(-\sin\theta,\cos\theta)$. This is the "columns are images of basis vectors" principle in action.

Sanity check. When $\theta = 0$, $R(0) = I$ — no rotation, matches. When $\theta = \pi/2$ (90°), $R = \begin{pmatrix}0&-1\\1&0\end{pmatrix}$, which rotates $(1,0)$ into $(0,1)$. Right.

Permutation matrix

An identity matrix with its rows rearranged. When you multiply $P\mathbf{x}$, you get $\mathbf{x}$ with its components permuted. Permutation matrices are orthogonal.

Projection matrix

A square matrix $P$ with $P^2 = P$. Applied twice, it does the same thing as once — because after the first application you're already in the target subspace. The "shine a light onto this subspace" operator.

$$P^2 = P$$

Projection (idempotent) matrix

$P$
A square matrix.
$P^2$
$P$ multiplied by itself (matrix mult).
$P^2 = P$
Applying $P$ twice is the same as once — idempotent.

Canonical example. The projection onto the $x$-axis in 2D is $\begin{pmatrix}1&0\\0&0\end{pmatrix}$. It flattens everything to the horizontal line. Once you're on the line, projecting again does nothing. Least squares is a projection onto the column space of $X$; attention, viewed from the right angle, is a weighted projection too.

9. Linear systems $A\mathbf{x}=\mathbf{b}$

The original motivation for linear algebra, and still the most common practical problem: you're given a matrix $A$ and a vector $\mathbf{b}$, and you want to find the vector $\mathbf{x}$ such that $A\mathbf{x} = \mathbf{b}$.

$$A\mathbf{x} = \mathbf{b}$$

The fundamental equation of linear algebra

$A$
A known $m \times n$ matrix.
$\mathbf{x}$
The unknown $n$-vector you're solving for.
$\mathbf{b}$
A known $m$-vector — the "target."

What it encodes. A system of $m$ linear equations in $n$ unknowns. Row $i$ says "this linear combination of the $x$'s equals $b_i$." Linear algebra is the apparatus for solving this problem robustly and at scale.

Three outcomes

There are exactly three things that can happen when you try to solve $A\mathbf{x} = \mathbf{b}$:

  1. Unique solution. There's exactly one $\mathbf{x}$. This is the best case. It happens when $A$ is square and "full rank" (see §11).
  2. No solution. The equation is inconsistent. Geometrically: $\mathbf{b}$ is not in the span of $A$'s columns — there's no way to combine the columns to hit it. Think "three equations, two unknowns, incompatible."
  3. Infinitely many solutions. There's at least one $\mathbf{x}$, but you can slide it along some direction and stay a solution. This happens when $A$ has fewer effective constraints than unknowns.

Geometrically, each row of $A\mathbf{x}=\mathbf{b}$ is a constraint ("stay on this hyperplane"). The solution set is the intersection of all the constraints. Three planes in 3D generically meet at a point; if two are parallel and different you get no solution; if all three coincide you get a whole plane of solutions.

Gaussian elimination

The canonical algorithm for solving $A\mathbf{x}=\mathbf{b}$, going back 2000+ years in China and named after Gauss in the 1800s. The idea is boring and unstoppable: add multiples of one equation to another until the system is upper triangular, then back-substitute.

Three operations are allowed (they don't change the solution set):

The resulting form is called row echelon form. If you also clean up the upper part to get 1s on the pivots and 0s above them, it's reduced row echelon form (RREF). The row of a pivot tells you which variables are "leading"; the rest are free parameters.

A note on what people actually use. No one solves medium-to-large linear systems with raw Gaussian elimination. They use LU decomposition (essentially the same algorithm, stored more cleverly), or QR for least squares, or iterative methods (conjugate gradient, GMRES) for truly huge sparse systems. That's the subject of numerical analysis. Gaussian elimination is still how you understand what's going on.

10. Vector spaces

We've been throwing around the word "vector" informally. Here's the formal version. A vector space is a set $V$ with two operations — addition and scalar multiplication — that obey the eight rules you'd expect (associativity, commutativity of addition, distributivity, existence of a zero element, etc.). For our purposes $V = \mathbb{R}^n$ is the only example that matters.

Span

The span of a set of vectors is the set of all their linear combinations:

$$\text{span}\{\mathbf{v}_1, \dots, \mathbf{v}_k\} = \{c_1 \mathbf{v}_1 + c_2\mathbf{v}_2 + \dots + c_k \mathbf{v}_k \ :\ c_i \in \mathbb{R}\}$$

Span

$\mathbf{v}_1, \dots, \mathbf{v}_k$
Some fixed set of vectors — our "ingredients."
$c_i$
Arbitrary real scalars ("how much of ingredient $i$").
$c_1 \mathbf{v}_1 + \dots + c_k\mathbf{v}_k$
A linear combination — a weighted sum. This is the only thing you can do with vectors in linear algebra.
$\{\ \cdot\ :\ \cdot\ \}$
Set-builder notation. Read as "the set of things of the form ___ such that ___ ."

Picture. The span of a single non-zero vector is a line through the origin. The span of two non-parallel vectors in the plane is the entire plane. The span of two parallel vectors is a line — the second one adds nothing. Span is "the reachable set."

Two vectors v₁ and v₂ (drag their tips). Their span is shown as a shaded region. The yellow arrow is the linear combination c₁v₁ + c₂v₂ — adjust the sliders to reach any point in the span. When v₁ and v₂ are parallel the span collapses to a line.

Linear independence

A set of vectors $\{\mathbf{v}_1, \dots, \mathbf{v}_k\}$ is linearly independent if the only way to get the zero vector as a combination is to use all zero coefficients:

$$c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \dots + c_k\mathbf{v}_k = \mathbf{0} \implies c_1 = c_2 = \dots = c_k = 0$$

Linear independence

$c_i$
Scalars.
$\mathbf{0}$
The zero vector.
$\implies$
"Implies." If the left side holds, then the right side must also hold.

Informal. The vectors are independent if none of them can be built from the others. Each one adds a genuinely new direction. Three vectors in 3D are independent exactly when they don't all lie in the same plane.

Basis and dimension

A basis of a vector space is a linearly independent set that spans the whole space. In $\mathbb{R}^n$, the "standard basis" is the set of unit vectors along each axis: $\mathbf{e}_1 = (1,0,\dots,0)$, $\mathbf{e}_2 = (0,1,\dots,0)$, and so on. Any vector in $\mathbb{R}^n$ is a unique linear combination of these.

The dimension of a vector space is the number of vectors in any basis. For $\mathbb{R}^n$ that's $n$. Different bases can have different vectors — they just always have the same count.

Why care? Because the whole point of choosing a basis is that expressing a vector or a matrix in the right basis can make a hard problem trivial. That's the entire content of diagonalization: find a basis in which $A$ looks like a diagonal matrix.

11. Rank and nullspace

Given a matrix $A$, there are four natural subspaces associated with it. They are called the four fundamental subspaces (Gil Strang's phrase) and they tell you everything about what $A$ does.

The rank of $A$ is the dimension of its column space — equivalently, the number of linearly independent columns. Equivalently again, the number of linearly independent rows. That these are all equal is already a small miracle.

Rank–nullity theorem

$$\text{rank}(A) + \dim(\text{Null}(A)) = n$$

Rank–nullity theorem

$A$
An $m \times n$ matrix.
$\text{rank}(A)$
Dimension of the column space of $A$.
$\dim(\text{Null}(A))$
Dimension of the null space — the "nullity."
$n$
Number of columns, i.e. input dimension.

Conservation law. The $n$ input directions are split into two disjoint kinds: "directions the matrix sees" (rank) and "directions the matrix kills" (nullity). Every direction is one or the other; there are exactly $n$ of them. This is the cleanest conservation law in linear algebra.

A square $n\times n$ matrix is full rank (rank $n$) if and only if its null space is just $\{\mathbf{0}\}$, if and only if it is invertible, if and only if its determinant is non-zero, if and only if $A\mathbf{x}=\mathbf{b}$ has a unique solution for every $\mathbf{b}$. These are all the same statement.

12. Determinants

The determinant $\det(A)$ of a square matrix is a single number that says "by how much does $A$ scale oriented volume?" That's it. That's the whole story.

For a 2×2 matrix:

$$\det\begin{pmatrix} a & b \\ c & d \end{pmatrix} = ad - bc$$

2×2 determinant

$a, b, c, d$
The four entries of the matrix.
$ad - bc$
A single number. Can be positive, negative, or zero.

Geometry. The unit square has area 1. After you apply the matrix to it, the image is a parallelogram (section 7 showed you this interactively). Its signed area is $ad - bc$. Sign indicates orientation: positive means "still right-handed," negative means the matrix flipped the plane. Zero means "the square collapsed to a line or a point" — the columns of the matrix are linearly dependent, so rank < 2.

For a 3×3 matrix, the determinant is the signed volume of the parallelepiped spanned by the columns. For an $n\times n$ matrix, it's the signed $n$-dimensional volume of the $n$-parallelepiped spanned by the columns. Every textbook defines it differently — by cofactor expansion, by permutations, by LU decomposition — but they all compute the same number, and the number always means this.

Key properties

If you see $\det(A) = 0$ anywhere in ML, something collapsed. Usually it's a signal to reach for a pseudoinverse or add regularization.

13. Eigenvalues and eigenvectors

Pick any square matrix $A$. Most vectors, when you apply $A$, come out pointing in a different direction. But for some special vectors, $A$ just stretches them — it doesn't rotate them. Those special directions are eigenvectors, and the stretch factors are eigenvalues.

$$A\mathbf{v} = \lambda \mathbf{v}$$

The eigenvalue equation

$A$
A square matrix — $n \times n$.
$\mathbf{v}$
A non-zero vector. The zero vector is excluded because it trivially satisfies $A\mathbf{0}=\mathbf{0}$ for any $\lambda$.
$\lambda$
A scalar — the eigenvalue. Can be real or complex.
$=$
Reads as: "applying $A$ to $\mathbf{v}$ gives back the same direction, just scaled by $\lambda$."

Geometric picture. Stand on an eigenvector. Apply the matrix. You stay on the same line through the origin — you just move farther (or closer, or flip) along it. Every other direction rotates. A matrix can have up to $n$ eigen-directions.

How to find them by hand

Rewrite the eigenvalue equation:

$$A\mathbf{v} - \lambda\mathbf{v} = \mathbf{0} \iff (A - \lambda I)\mathbf{v} = \mathbf{0}$$

The characteristic equation setup

$(A - \lambda I)$
A new matrix — $A$ with $\lambda$ subtracted off each diagonal entry.
$\mathbf{v}$
A non-zero vector we're trying to find.
$\mathbf{0}$
Zero vector.

Key insight. For a non-zero $\mathbf{v}$ to satisfy $(A-\lambda I)\mathbf{v} = \mathbf{0}$, the matrix $(A-\lambda I)$ must be singular — otherwise the only solution would be $\mathbf{v}=\mathbf{0}$. Singular matrices have determinant zero.

So the eigenvalues are the roots of the characteristic polynomial:

$$\det(A - \lambda I) = 0$$

Characteristic polynomial

$\det$
Determinant.
$A - \lambda I$
$A$ with $\lambda$ subtracted off the diagonal.
$= 0$
The scalar equation whose solutions are the eigenvalues.

Why polynomial? The determinant, expanded out in terms of $\lambda$, is a polynomial of degree $n$ in $\lambda$. Its roots are the eigenvalues. By the fundamental theorem of algebra it has exactly $n$ roots (counting multiplicity, possibly complex). So an $n \times n$ matrix has exactly $n$ eigenvalues — some may repeat, some may be complex.

Worked example: a 2×2 case

Take $A = \begin{pmatrix} 4 & 1 \\ 2 & 3 \end{pmatrix}$. Then:

$$\det(A - \lambda I) = \det\begin{pmatrix} 4-\lambda & 1 \\ 2 & 3-\lambda \end{pmatrix} = (4-\lambda)(3-\lambda) - 2 = \lambda^2 - 7\lambda + 10$$

Expanding the determinant

$(4-\lambda)(3-\lambda)$
Product of the diagonal entries of $A - \lambda I$.
$-2$
Subtraction of the off-diagonal product, which is $1\cdot 2 = 2$. (Recall $\det \begin{pmatrix}a&b\\c&d\end{pmatrix} = ad-bc$.)
$\lambda^2 - 7\lambda + 10$
The result after multiplying out — a degree-2 polynomial in $\lambda$.

Solve the quadratic: $\lambda^2 - 7\lambda + 10 = 0$ factors as $(\lambda - 5)(\lambda - 2)$. So $\lambda_1 = 5$, $\lambda_2 = 2$. To find the eigenvectors, plug each $\lambda$ back into $(A-\lambda I)\mathbf{v}=\mathbf{0}$:

Check: $A\mathbf{v}_1 = \begin{pmatrix}4+1\\2+3\end{pmatrix} = \begin{pmatrix}5\\5\end{pmatrix} = 5\mathbf{v}_1$. Good. $A\mathbf{v}_2 = \begin{pmatrix}4-2\\2-6\end{pmatrix} = \begin{pmatrix}2\\-4\end{pmatrix} = 2\mathbf{v}_2$. Good.

Once you've done this a few times, you'll never want to do it for anything bigger than 3×3 again. Everything larger gets computed numerically.

A 2×2 matrix transforms the unit circle into an ellipse. The yellow arrows are the eigenvectors — the special directions that only get stretched (or flipped), never rotated. Adjust the matrix entries to see how the eigenvectors and eigenvalues change.

14. Diagonalization

Suppose $A$ is a square matrix with $n$ linearly independent eigenvectors $\mathbf{v}_1, \dots, \mathbf{v}_n$ and eigenvalues $\lambda_1, \dots, \lambda_n$. Pack the eigenvectors into the columns of a matrix $P$ and the eigenvalues onto the diagonal of a matrix $D$. Then:

$$A = P D P^{-1}$$

Diagonalization (eigendecomposition)

$A$
A square $n\times n$ matrix.
$P$
The matrix whose columns are the eigenvectors of $A$.
$D$
A diagonal matrix with $\lambda_1, \dots, \lambda_n$ on the diagonal.
$P^{-1}$
The inverse of $P$ (exists because the eigenvectors are independent).

How to read this. "$A$, viewed in the eigenvector basis, is just a diagonal matrix." Reading right to left: $P^{-1}$ changes coordinates into the eigenbasis, $D$ scales each eigendirection independently, $P$ changes back. A messy non-diagonal $A$ is secretly a diagonal operation in disguise.

Not every matrix is diagonalizable — if it has fewer than $n$ independent eigenvectors it isn't, and you need Jordan normal form. But every symmetric matrix is diagonalizable by an orthogonal matrix (the spectral theorem), and even matrices that aren't diagonalizable over the reals can be diagonalized in the generalized sense of SVD — see the next section.

Diagonalization is why eigenvalues matter in practice. If $A = PDP^{-1}$, then:

$$A^k = P D^k P^{-1}$$

Powers of a diagonalizable matrix

$A^k$
$A$ multiplied by itself $k$ times.
$D^k$
Diagonal matrix with $\lambda_i^k$ on the diagonal — trivially cheap.
$P, P^{-1}$
Same change-of-basis matrices as before; they cancel in the middle when you expand the product.

Consequence. Raising $A$ to the 1000th power would be absurd if you did it directly. Through the eigenbasis it's just raising each eigenvalue to the 1000th power. This is how you solve linear recurrences (Fibonacci has a closed form this way), analyze Markov chains, and — in ML — understand how gradient descent with linear dynamics converges.

15. Orthogonal bases and Gram–Schmidt

A basis is orthogonal if all pairs of basis vectors are perpendicular: $\mathbf{v}_i \cdot \mathbf{v}_j = 0$ for $i \ne j$. It's orthonormal if additionally each basis vector has length 1. Orthonormal bases are magical: once you have one, everything is easy.

In an orthonormal basis $\{\mathbf{q}_1, \dots, \mathbf{q}_n\}$, any vector $\mathbf{x}$ is:

$$\mathbf{x} = \sum_{i=1}^n (\mathbf{x}\cdot\mathbf{q}_i)\,\mathbf{q}_i$$

Coordinates in an orthonormal basis

$\mathbf{q}_i$
The $i$-th orthonormal basis vector.
$\mathbf{x}\cdot\mathbf{q}_i$
A scalar — the "coordinate" of $\mathbf{x}$ along $\mathbf{q}_i$, just a projection.
$(\mathbf{x}\cdot\mathbf{q}_i)\mathbf{q}_i$
Scalar times direction — a vector along $\mathbf{q}_i$ of the right length.

Why this is great. You don't have to solve a linear system to find coordinates in an orthonormal basis — you just take dot products. Fourier series work exactly this way. PCA works exactly this way. Finding your basis components becomes one matrix-vector multiply by $Q^T$.

Gram–Schmidt is the recipe for turning an arbitrary linearly independent set $\{\mathbf{a}_1, \dots, \mathbf{a}_k\}$ into an orthonormal set $\{\mathbf{q}_1, \dots, \mathbf{q}_k\}$ spanning the same space. You process the vectors one at a time, subtracting off the part that points along the directions you've already made orthonormal:

$$\mathbf{u}_i = \mathbf{a}_i - \sum_{j=1}^{i-1}(\mathbf{a}_i\cdot\mathbf{q}_j)\mathbf{q}_j, \qquad \mathbf{q}_i = \frac{\mathbf{u}_i}{\|\mathbf{u}_i\|}$$

Gram–Schmidt orthogonalization

$\mathbf{a}_i$
The $i$-th input vector.
$\mathbf{q}_j$
The $j$-th already-orthonormalized basis vector ($j < i$).
$(\mathbf{a}_i\cdot\mathbf{q}_j)\mathbf{q}_j$
Projection of $\mathbf{a}_i$ onto $\mathbf{q}_j$.
$\mathbf{u}_i$
What's left of $\mathbf{a}_i$ after removing its components along the earlier basis vectors — guaranteed perpendicular to all of them.
$\mathbf{q}_i = \mathbf{u}_i/\|\mathbf{u}_i\|$
Normalize to length 1.

Result. An orthonormal basis $\{\mathbf{q}_1,\dots,\mathbf{q}_k\}$ whose span equals the span of the input. Gram–Schmidt is also how you prove that the QR decomposition exists: any full-column-rank matrix $A$ can be written as $A = QR$ with $Q$ orthogonal and $R$ upper triangular.

Drag a₁ and a₂ to see Gram–Schmidt live. q₁ = a₁ normalized. The dashed line shows the projection of a₂ onto q₁ that gets subtracted. q₂ is the residual after subtracting, then normalized — always perpendicular to q₁.

16. The Singular Value Decomposition

Now the big one. The SVD is the most important factorization in linear algebra and, by extension, in machine learning. It is to linear algebra what the Taylor series is to calculus: it takes any object of the subject and writes it in a standard, useful form.

SVD theorem. Any $m \times n$ matrix $A$ — real, rectangular, ugly, singular, any — can be written as $A = U\Sigma V^T$, where $U$ is $m \times m$ orthogonal, $V$ is $n \times n$ orthogonal, and $\Sigma$ is $m \times n$ with non-negative entries only on the diagonal.
$$A = U\Sigma V^T$$

SVD (Singular Value Decomposition)

$A$
Any $m\times n$ matrix. No assumptions.
$U$
An $m\times m$ orthogonal matrix. Its columns $\mathbf{u}_1, \dots, \mathbf{u}_m$ are the left singular vectors.
$\Sigma$
An $m \times n$ matrix with singular values $\sigma_1 \ge \sigma_2 \ge \dots \ge 0$ on the diagonal and zeros everywhere else. The $\sigma_i$ are non-negative real numbers.
$V^T$
The transpose of an $n\times n$ orthogonal matrix $V$. Columns of $V$ are right singular vectors.

The geometric content. Any linear transformation is really just three steps:

  1. Rotate (or reflect) the input space with $V^T$.
  2. Scale each axis independently by $\sigma_i$ with $\Sigma$.
  3. Rotate again with $U$ to land in the output space.
That's it. Every matrix in the universe is this. Pick any 2×2 matrix, apply it to a unit circle, and you get an ellipse — the ellipse's axes are $\sigma_1 \mathbf{u}_1$ and $\sigma_2 \mathbf{u}_2$. The $\sigma_i$ tell you how much stretching happens in each principal direction, and if some $\sigma_i$ are zero, the matrix is rank-deficient and collapses a dimension.

The SVD says every matrix = rotate (VT) → scale by singular values σ₁,σ₂ (Σ) → rotate again (U). The dashed circle is the input unit circle; the cyan ellipse is its image under the matrix. The gold axes are the principal directions. σ₁ ≥ σ₂ controls how elongated the ellipse is.

Why SVD is the most important factorization in ML

Four huge applications, all of which reduce to "look at the SVD":

The Eckart–Young truncation:

$$A_k = \sum_{i=1}^{k} \sigma_i\, \mathbf{u}_i \mathbf{v}_i^T$$

Low-rank approximation

$A_k$
The best rank-$k$ approximation to $A$ in the least-squares sense.
$\sigma_i$
The $i$-th singular value.
$\mathbf{u}_i, \mathbf{v}_i$
The $i$-th left and right singular vectors.
$\mathbf{u}_i\mathbf{v}_i^T$
An outer product — an $m\times n$ rank-1 matrix.
$\sum_{i=1}^k$
Sum the top $k$ rank-1 pieces.

Practical meaning. If the singular values decay fast, a rank-$k$ approximation with tiny $k$ captures almost all of $A$. That is why low-rank compression works: real-world data matrices (word co-occurrences, user-item ratings, image patches, neural network weights) are approximately low-rank. LoRA is literally this: store $A \approx A_k$ as $UV^T$ of rank $k = 8$ or 16.

17. Tensors

You've seen scalars, vectors, and matrices. A tensor is just the natural generalisation: an array of numbers indexed by more than two indices, equipped with rules for how it transforms when you change coordinates. Vectors are rank-1 tensors. Matrices are rank-2 tensors. A 3-D grid of numbers is a rank-3 tensor, and so on.

The rank (or order) of a tensor is its number of indices.

  • Rank 0 — scalar. A single number. No indices needed. Example: the temperature outside right now.
  • Rank 1 — vector. $v_i$. One free index $i$. Example: a word embedding of length 512.
  • Rank 2 — matrix. $A_{ij}$. Two free indices. Example: a weight matrix $W \in \mathbb{R}^{m \times n}$.
  • Rank 3. $T_{ijk}$. Three indices. Example: a batch of images $(B, H, W)$ — batch index, height, width.
  • Rank 4. $T_{nchw}$. Example: standard CNN input with shape $(N, C, H, W)$ — batch, channels, height, width.

Einstein summation notation

Writing $\sum_j A_{ij} v_j$ for every matrix-vector product gets tedious. Einstein's convention: if an index appears twice in a term, it is summed over. You never write the $\Sigma$.

OperationRegular notationEinstein notation
Matrix-vector product$\sum_j A_{ij} v_j$$A_{ij} v_j$
Matrix-matrix product$\sum_k A_{ik} B_{kj}$$A_{ik} B_{kj}$
Dot product$\sum_i u_i v_i$$u_i v_i$
Trace$\sum_i A_{ii}$$A_{ii}$
Outer product$(uv^T)_{ij} = u_i v_j$$u_i v_j$ (no repeated index — no sum)
Batch matmul$C_{bij} = \sum_k A_{bik} B_{bkj}$$A_{bik} B_{bkj}$

NumPy and PyTorch both implement this with einsum: torch.einsum('bik,bkj->bij', A, B) is a batched matrix multiply. Reading einsum strings is a core skill in ML engineering.

Tensor operations

Outer product
$u \otimes v = u_i v_j$ — takes two rank-1 tensors and produces a rank-2 tensor (a matrix). Generalises: $u \otimes v \otimes w = u_i v_j w_k$ gives rank-3. The tensor product of spaces is built from this.
Contraction
Sum over a pair of matching indices. Matrix-vector product is a contraction. The trace is a contraction of both indices of a matrix. Each contraction reduces the rank by 2.
Transpose / permute
Reorder the indices: $A_{ij} \to A_{ji}$ is the usual transpose. For higher-rank tensors, a permutation like $(0, 2, 1, 3)$ reorders the four axes. In PyTorch: x.permute(0, 2, 1, 3).
Reshape / view
Change the shape without touching the data. A $(B, H \times W, C)$ tensor can be reshaped to $(B, H, W, C)$ — same underlying array, different index interpretation. Reshape does not change values.
Frobenius norm
$\|T\|_F = \sqrt{\sum_{i_1 \cdots i_k} T_{i_1 \cdots i_k}^2}$ — the generalisation of the matrix Frobenius norm to any rank. Compute it in NumPy with np.linalg.norm(T).

Tensors in ML practice

Modern deep learning frameworks (PyTorch, JAX, TensorFlow) are built around tensors. Here is what each axis typically means:

ContextTensor shapeAxis meanings
Image batch$(N, C, H, W)$batch · channels · height · width
Text batch$(N, L)$batch · sequence length (token ids)
Embedded text$(N, L, d)$batch · tokens · embedding dim
Attention QKV$(N, H, L, d_k)$batch · heads · tokens · head dim
Conv2D weights$(C_{out}, C_{in}, k_H, k_W)$output channels · input channels · kernel height · kernel width
Video$(N, T, C, H, W)$batch · time · channels · height · width

The multi-head attention formula $\text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ is really a rank-4 einsum: 'bhid,bhjd->bhij' for the dot-product step, followed by 'bhij,bhjd->bhid' for the weighted sum — all batched over both the batch dimension $b$ and the heads dimension $h$ simultaneously.

The connection to matrices

Any rank-$k$ tensor with shape $(n_1, n_2, \ldots, n_k)$ can be unfolded (also called matricised) into a matrix by grouping some indices as rows and the rest as columns. This is the key insight behind tensor decompositions:

Outer product u ⊗ v — each cell shows $u_i \cdot v_j$. Colour intensity ∝ magnitude; red = negative.

Source code — Tensors & einsum (Pseudocode · Python · JavaScript · C · C++ · Java · Go)
// ── Tensor fundamentals ──────────────────────────────────────
// Outer product (rank-1 ⊗ rank-1 → rank-2)
function outer(u[n], v[m]) -> T[n][m]:
    for i in 0..n:
        for j in 0..m:
            T[i][j] = u[i] * v[j]
    return T

// Einstein summation: matrix multiply
// 'ik,kj->ij'  means sum over k
function matmul_einsum(A[m][k], B[k][n]) -> C[m][n]:
    for i in 0..m:
        for j in 0..n:
            C[i][j] = sum(A[i][p] * B[p][j]  for p in 0..k)

// Tensor contraction: contract axis-1 of T3 with axis-0 of M
// T3[i][j][k], M[j][l]  →  R[i][l][k]  (sum over j)
function contract(T3[I][J][K], M[J][L]) -> R[I][L][K]:
    for i,l,k:
        R[i][l][k] = sum(T3[i][j][k] * M[j][l]  for j in 0..J)

// Reshape: (A*B, C) → (A, B, C)  -- no data copy
function reshape(T[N][C], A, B):
    assert A * B == N
    return view T as T2[A][B][C]

// Frobenius norm over arbitrary shape
function frobenius(T):
    return sqrt(sum(x*x for x in flatten(T)))
import numpy as np

# ── Outer product ─────────────────────────────────────────────
u = np.array([2.0, 1.0, -1.0])
v = np.array([1.0, 3.0, 2.0])

outer = np.outer(u, v)          # shape (3, 3)
# Equivalently: np.einsum('i,j->ij', u, v)
print("u ⊗ v:\n", outer)

# ── Einstein summation ────────────────────────────────────────
A = np.random.randn(4, 5)
B = np.random.randn(5, 6)

# Matrix multiply
C = np.einsum('ik,kj->ij', A, B)            # (4, 6)
assert np.allclose(C, A @ B)

# Dot product
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])
dot = np.einsum('i,i->', a, b)              # scalar
print("dot:", dot)

# Trace
M = np.eye(4) * 3
tr = np.einsum('ii->', M)                   # 12.0

# Batch matmul (N, I, K) × (N, K, J) → (N, I, J)
N = 8
X = np.random.randn(N, 4, 5)
Y = np.random.randn(N, 5, 6)
Z = np.einsum('bik,bkj->bij', X, Y)        # (8, 4, 6)
assert np.allclose(Z, X @ Y)

# ── Rank-3 tensor operations ──────────────────────────────────
T = np.random.randn(2, 3, 4)   # shape (2,3,4)

# Permute axes
T_perm = np.transpose(T, (1, 0, 2))   # (3, 2, 4)

# Reshape / view
T_flat = T.reshape(2, 12)              # (2, 12)
T_back = T_flat.reshape(2, 3, 4)      # back to (2,3,4)

# Frobenius norm
fnorm = np.linalg.norm(T)

# ── CNN-style 4-D tensor (N, C, H, W) ────────────────────────
batch = np.random.randn(16, 3, 32, 32)  # 16 RGB 32×32 images
print("Batch shape:", batch.shape)

# Sum over spatial dims → (N, C)
spatial_mean = batch.mean(axis=(2, 3))
print("Spatial mean shape:", spatial_mean.shape)

# ── Tucker-like low-rank outer product sum ────────────────────
# Rank-1 approximation: T ≈ a ⊗ b ⊗ c
a_ = np.array([1.0, -1.0])
b_ = np.array([0.5, 1.0, -0.5])
c_ = np.array([2.0, 0.0, -1.0, 1.0])
rank1 = np.einsum('i,j,k->ijk', a_, b_, c_)   # (2,3,4)
// ── Outer product ─────────────────────────────────────────────
function outer(u, v) {
  return u.map(ui => v.map(vj => ui * vj));
}

// ── Einsum helpers (basic cases) ─────────────────────────────
// Matrix multiply: 'ik,kj->ij'
function matmul(A, B) {
  const [m, k] = [A.length, A[0].length];
  const n = B[0].length;
  return Array.from({length: m}, (_, i) =>
    Array.from({length: n}, (_, j) =>
      Array.from({length: k}, (_, p) => A[i][p] * B[p][j])
        .reduce((s, x) => s + x, 0)
    )
  );
}

// Dot product: 'i,i->'
function dot(a, b) {
  return a.reduce((s, ai, i) => s + ai * b[i], 0);
}

// Batch matmul: 'bik,bkj->bij'
function batchMatmul(A, B) {
  return A.map((a, b) => matmul(a, B[b]));
}

// ── Rank-3 tensor indexing ────────────────────────────────────
// Shape (D0, D1, D2) stored as nested arrays
function zeros(d0, d1, d2) {
  return Array.from({length: d0}, () =>
    Array.from({length: d1}, () => new Float32Array(d2))
  );
}

// Outer product of three vectors → rank-3 tensor
function outer3(a, b, c) {
  return a.map(ai =>
    b.map(bi =>
      c.map(ci => ai * bi * ci)
    )
  );
}

// Frobenius norm
function frobeniusNorm(T) {
  let sum = 0;
  function recurse(x) {
    if (Array.isArray(x)) x.forEach(recurse);
    else sum += x * x;
  }
  recurse(T);
  return Math.sqrt(sum);
}

// ── Demo ──────────────────────────────────────────────────────
const u = [2, 1, -1], v = [1, 3, 2];
console.log("u ⊗ v:", outer(u, v));

const A = [[1,2],[3,4]], B = [[5,6],[7,8]];
console.log("A @ B:", matmul(A, B));

const t3 = outer3([1,-1], [0.5,1,-0.5], [2,0,-1,1]);
console.log("rank-3 shape:", t3.length, t3[0].length, t3[0][0].length);
console.log("Frobenius norm:", frobeniusNorm(t3).toFixed(4));
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>

/* ── Outer product ────────────────────────────────────────────
   u[m], v[n]  →  out[m*n]  (row-major) */
void outer(const double *u, int m,
           const double *v, int n,
           double *out) {
    for (int i = 0; i < m; i++)
        for (int j = 0; j < n; j++)
            out[i*n + j] = u[i] * v[j];
}

/* ── Matrix multiply: C[m][n] = A[m][k] * B[k][n] ─────────── */
void matmul(const double *A, const double *B, double *C,
            int m, int k, int n) {
    memset(C, 0, m * n * sizeof(double));
    for (int i = 0; i < m; i++)
        for (int p = 0; p < k; p++)
            for (int j = 0; j < n; j++)
                C[i*n + j] += A[i*k + p] * B[p*n + j];
}

/* ── Rank-3 tensor contraction ────────────────────────────────
   T[I][J][K], M[J][L]  →  R[I][L][K]  (sum over j) */
void contract3(const double *T, int I, int J, int K,
               const double *M, int L,
               double *R) {
    memset(R, 0, I * L * K * sizeof(double));
    for (int i = 0; i < I; i++)
      for (int j = 0; j < J; j++)
        for (int k = 0; k < K; k++)
          for (int l = 0; l < L; l++)
            R[(i*L + l)*K + k] +=
              T[(i*J + j)*K + k] * M[j*L + l];
}

/* ── Frobenius norm ───────────────────────────────────────────*/
double frob(const double *T, int n) {
    double s = 0;
    for (int i = 0; i < n; i++) s += T[i]*T[i];
    return sqrt(s);
}

int main(void) {
    double u[] = {2, 1, -1}, v[] = {1, 3, 2};
    double M[9];
    outer(u, 3, v, 3, M);
    printf("u⊗v[0][1] = %.1f  (expect 6)\n", M[0*3+1]);

    double A[] = {1,2,3,4}, B[] = {5,6,7,8}, C[4];
    matmul(A, B, C, 2, 2, 2);
    printf("(A@B)[0][0] = %.0f  (expect 19)\n", C[0]);
    printf("Frobenius |M| = %.4f\n", frob(M, 9));
    return 0;
}
#include <iostream>
#include <vector>
#include <cmath>
#include <numeric>
#include <cassert>

// ── Simple rank-N tensor (flat storage + shape) ───────────────
struct Tensor {
    std::vector<int> shape;
    std::vector<double> data;

    Tensor(std::vector<int> sh) : shape(sh) {
        int n = 1;
        for (int d : sh) n *= d;
        data.assign(n, 0.0);
    }

    // Linear index from multi-index
    int idx(std::vector<int> ix) const {
        int lin = 0, stride = 1;
        for (int d = shape.size()-1; d >= 0; --d) {
            lin += ix[d] * stride;
            stride *= shape[d];
        }
        return lin;
    }

    double& at(std::vector<int> ix) { return data[idx(ix)]; }
    double  at(std::vector<int> ix) const { return data[idx(ix)]; }

    double frobenius() const {
        double s = 0;
        for (double x : data) s += x*x;
        return std::sqrt(s);
    }
};

// ── Outer product: rank-1 ⊗ rank-1 → rank-2 ──────────────────
Tensor outer(const std::vector<double>& u,
             const std::vector<double>& v) {
    int m = u.size(), n = v.size();
    Tensor T({m, n});
    for (int i = 0; i < m; i++)
        for (int j = 0; j < n; j++)
            T.at({i,j}) = u[i] * v[j];
    return T;
}

// ── Matrix multiply via einsum 'ik,kj->ij' ───────────────────
Tensor matmul(const Tensor& A, const Tensor& B) {
    int m = A.shape[0], k = A.shape[1], n = B.shape[1];
    assert(B.shape[0] == k);
    Tensor C({m, n});
    for (int i = 0; i < m; i++)
      for (int p = 0; p < k; p++)
        for (int j = 0; j < n; j++)
            C.at({i,j}) += A.at({i,p}) * B.at({p,j});
    return C;
}

int main() {
    std::vector<double> u = {2, 1, -1}, v = {1, 3, 2};
    auto M = outer(u, v);
    std::cout << "M[0][1] = " << M.at({0,1}) << "  (expect 6)\n";
    std::cout << "Frobenius = " << M.frobenius() << "\n";

    Tensor A({2,2}), B({2,2});
    A.at({0,0})=1; A.at({0,1})=2; A.at({1,0})=3; A.at({1,1})=4;
    B.at({0,0})=5; B.at({0,1})=6; B.at({1,0})=7; B.at({1,1})=8;
    auto C = matmul(A, B);
    std::cout << "(A@B)[0][0] = " << C.at({0,0}) << "  (expect 19)\n";
    return 0;
}
import java.util.Arrays;

public class Tensors {

    // ── Outer product u[m] ⊗ v[n] → M[m][n] ─────────────────
    static double[][] outer(double[] u, double[] v) {
        double[][] M = new double[u.length][v.length];
        for (int i = 0; i < u.length; i++)
            for (int j = 0; j < v.length; j++)
                M[i][j] = u[i] * v[j];
        return M;
    }

    // ── Matrix multiply A[m][k] @ B[k][n] → C[m][n] ─────────
    static double[][] matmul(double[][] A, double[][] B) {
        int m = A.length, k = A[0].length, n = B[0].length;
        double[][] C = new double[m][n];
        for (int i = 0; i < m; i++)
          for (int p = 0; p < k; p++)
            for (int j = 0; j < n; j++)
                C[i][j] += A[i][p] * B[p][j];
        return C;
    }

    // ── Frobenius norm ────────────────────────────────────────
    static double frobenius(double[][] M) {
        double s = 0;
        for (double[] row : M)
            for (double x : row) s += x * x;
        return Math.sqrt(s);
    }

    // ── Rank-3 outer product u⊗v⊗w ───────────────────────────
    static double[][][] outer3(double[] u, double[] v, double[] w) {
        double[][][] T = new double[u.length][v.length][w.length];
        for (int i = 0; i < u.length; i++)
          for (int j = 0; j < v.length; j++)
            for (int k = 0; k < w.length; k++)
                T[i][j][k] = u[i] * v[j] * w[k];
        return T;
    }

    public static void main(String[] args) {
        double[] u = {2, 1, -1}, v = {1, 3, 2};
        double[][] M = outer(u, v);
        System.out.printf("M[0][1] = %.1f  (expect 6.0)%n", M[0][1]);
        System.out.printf("Frobenius = %.4f%n", frobenius(M));

        double[][] A = {{1,2},{3,4}}, B = {{5,6},{7,8}};
        double[][] C = matmul(A, B);
        System.out.printf("(A@B)[0][0] = %.0f  (expect 19)%n", C[0][0]);

        double[] w = {0.5, -1, 2};
        double[][][] T3 = outer3(u, v, w);
        System.out.printf("T3[0][1][2] = %.1f  (expect %.1f)%n",
            T3[0][1][2], u[0]*v[1]*w[2]);
    }
}
package main

import (
	"fmt"
	"math"
)

// ── Outer product ─────────────────────────────────────────────
func outer(u, v []float64) [][]float64 {
	m, n := len(u), len(v)
	M := make([][]float64, m)
	for i := range M {
		M[i] = make([]float64, n)
		for j := range M[i] {
			M[i][j] = u[i] * v[j]
		}
	}
	return M
}

// ── Matrix multiply ───────────────────────────────────────────
func matmul(A, B [][]float64) [][]float64 {
	m, k, n := len(A), len(A[0]), len(B[0])
	C := make([][]float64, m)
	for i := range C {
		C[i] = make([]float64, n)
	}
	for i := 0; i < m; i++ {
		for p := 0; p < k; p++ {
			for j := 0; j < n; j++ {
				C[i][j] += A[i][p] * B[p][j]
			}
		}
	}
	return C
}

// ── Frobenius norm ────────────────────────────────────────────
func frobenius(M [][]float64) float64 {
	var s float64
	for _, row := range M {
		for _, x := range row {
			s += x * x
		}
	}
	return math.Sqrt(s)
}

// ── Rank-3 outer product ──────────────────────────────────────
func outer3(a, b, c []float64) [][][]float64 {
	T := make([][][]float64, len(a))
	for i := range T {
		T[i] = make([][]float64, len(b))
		for j := range T[i] {
			T[i][j] = make([]float64, len(c))
			for k := range T[i][j] {
				T[i][j][k] = a[i] * b[j] * c[k]
			}
		}
	}
	return T
}

func main() {
	u := []float64{2, 1, -1}
	v := []float64{1, 3, 2}
	M := outer(u, v)
	fmt.Printf("M[0][1] = %.1f  (expect 6.0)\n", M[0][1])
	fmt.Printf("Frobenius = %.4f\n", frobenius(M))

	A := [][]float64{{1, 2}, {3, 4}}
	B := [][]float64{{5, 6}, {7, 8}}
	C := matmul(A, B)
	fmt.Printf("(A@B)[0][0] = %.0f  (expect 19)\n", C[0][0])

	w := []float64{0.5, -1, 2}
	T3 := outer3(u, v, w)
	fmt.Printf("T3[0][1][2] = %.1f  (expect %.1f)\n",
		T3[0][1][2], u[0]*v[1]*w[2])
}

18. PCA in 30 lines

Let's walk through PCA (Principal Component Analysis) as a single, concrete application of SVD — no more abstract fanfare.

You have $N$ data points, each a $d$-dimensional vector. You stack them into a data matrix $X$ of shape $N \times d$. You want to represent each point with just $k \ll d$ numbers, while losing as little information as possible.

The recipe:

  1. Center the data. Subtract the mean of each column. Call the result $\tilde X$. This ensures the data cloud is centered at the origin.
  2. Compute the SVD: $\tilde X = U\Sigma V^T$.
  3. The columns of $V$ are the principal components. $V_1$ is the direction of maximum variance in the data; $V_2$ is the direction of second-most variance, perpendicular to $V_1$; and so on.
  4. Project: the $k$-dimensional representation of each data point is $\tilde X V_{:,1:k}$, where you keep only the first $k$ columns of $V$.
  5. Reconstruct (optional): $\tilde X_{\text{approx}} = \tilde X V_{:,1:k} V_{:,1:k}^T$.

The variance captured by the $i$-th principal component is $\sigma_i^2 / (N-1)$. Sum the top $k$ squared singular values and divide by the total sum of squared singular values; that's the fraction of variance you kept. If it's 99% with $k=20$ on 1000-dimensional data, you just compressed by 50× with almost no loss.

Geometric picture. PCA fits the best $k$-dimensional subspace to the cloud of data points, in the least-squares sense. For $k=1$ it's fitting a line; for $k=2$ a plane; for $k=3$ a 3-flat. "Best" means "minimizes the total squared distance from the points to the subspace." That the answer is the top singular vectors of the data matrix is a beautiful consequence of Eckart–Young.

19. Applications to ML

Here's the quick survey. Every item on this list reuses the stuff above.

Embeddings are vectors

Every embedding model maps things (words, sentences, images, users, molecules) to vectors in $\mathbb{R}^d$, typically with $d$ between 128 and 4096. You then compare them with cosine similarity — which we showed is dot product normalized by norms. The entire business of semantic search is linear algebra over these embeddings.

A neural network layer is $W\mathbf{x}+\mathbf{b}$

The fully connected ("dense") layer, the workhorse of every deep network, does:

$$\mathbf{y} = W\mathbf{x} + \mathbf{b}$$

Dense layer

$\mathbf{x}$
Input vector of size $d_{\text{in}}$.
$W$
The weight matrix — size $d_{\text{out}} \times d_{\text{in}}$. Learned.
$\mathbf{b}$
A bias vector of size $d_{\text{out}}$. Learned.
$\mathbf{y}$
Output — an activation then gets applied (ReLU, GELU) before the next layer.

In a GPT-style model, $W$ might be $4096 \times 4096$, and there are dozens of these in each of tens of layers. When someone says "run a forward pass," they mean: chain a few hundred of these linear transforms together, interleaved with non-linearities. Training is gradient descent on the entries of all the $W$'s. That's what a trillion-parameter model is — a trillion entries, across many $W$'s.

Attention is $\text{softmax}(QK^T/\sqrt{d_k})V$

Self-attention computes, for each query vector, a weighted sum of value vectors, where the weights are softmaxed dot products between the query and keys. Three matrix multiplies plus a scaling. See self-attention for the full mechanics — the math is entirely the stuff on this page.

Softmax, layer norm, residuals

Weight matrices, and why they're often nearly low-rank

Empirically, the weight matrices of trained transformers are approximately low-rank — their spectrum decays quickly. That's the observation LoRA exploits: instead of fine-tuning the full $W$, add a low-rank update $\Delta W = BA$ where $A \in \mathbb{R}^{r \times d}$ and $B \in \mathbb{R}^{d \times r}$ with $r$ small (8 to 64). You update $r \cdot 2d$ parameters instead of $d^2$. Same SVD intuition, applied to updates instead of weights.

20. NumPy code

Three tabs: core matrix operations, eigendecomposition, and SVD + PCA. No library beyond NumPy.

linear algebra · numpy
import numpy as np

# Vectors
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])

# Dot product — four equivalent forms
print(np.dot(a, b))          # 32.0
print(a @ b)                 # 32.0  (Python 3.5+ operator)
print((a * b).sum())         # 32.0
print(np.einsum("i,i->", a, b))# 32.0

# Norms
print(np.linalg.norm(a))                  # L2 norm ≈ 3.742
print(np.linalg.norm(a, ord=1))           # L1 norm = 6.0
print(np.linalg.norm(a, ord=np.inf))      # L-inf norm = 3.0

# Matrices
A = np.array([[1.0, 2.0], [3.0, 4.0]])
B = np.array([[5.0, 6.0], [7.0, 8.0]])

# Matrix multiplication — use @, not *
C = A @ B                             # [[19, 22], [43, 50]]
print(C)

# A * B is ELEMENT-WISE, not matrix multiplication. Classic bug.
elementwise = A * B                   # [[ 5, 12], [21, 32]]

# Transpose, inverse, determinant
print(A.T)                          # [[1, 3], [2, 4]]
print(np.linalg.inv(A))              # [[-2, 1], [1.5, -0.5]]
print(np.linalg.det(A))              # -2.0

# Solving Ax = b — NEVER compute inv(A) @ b. Use solve().
b_vec = np.array([5.0, 11.0])
x = np.linalg.solve(A, b_vec)        # [1., 2.] (unique solution exists)
print("check residual:", A @ x - b_vec)
import numpy as np

# A general 2x2 matrix
A = np.array([[4.0, 1.0],
              [2.0, 3.0]])

# Eigenvalues and eigenvectors (columns of V are the eigenvectors)
vals, V = np.linalg.eig(A)
print("eigenvalues:", vals)            # [5., 2.]
print("eigenvectors (cols):", V)

# Sanity check: A v = lambda v for each eigenpair
for i in range(len(vals)):
    lhs = A @ V[:, i]
    rhs = vals[i] * V[:, i]
    print(np.allclose(lhs, rhs))       # True

# Reconstruct A = P D P^-1 (diagonalization)
D = np.diag(vals)
A_rebuilt = V @ D @ np.linalg.inv(V)
print(np.allclose(A_rebuilt, A))        # True

# For SYMMETRIC matrices, use eigh — it is faster and gives
# real eigenvalues and orthonormal eigenvectors.
S = np.array([[2.0, 1.0],
              [1.0, 3.0]])
w, Q = np.linalg.eigh(S)
print(w)                              # real, sorted ascending
print(Q.T @ Q)                         # ≈ identity: Q is orthogonal

# Power iteration: find the dominant eigenvalue without eig().
# Used for PageRank, and as a building block for Lanczos / Arnoldi.
def power_iter(M, steps=100):
    v = np.random.randn(M.shape[0])
    for _ in range(steps):
        v = M @ v
        v /= np.linalg.norm(v)
    lam = v @ (M @ v)                 # Rayleigh quotient
    return lam, v

lam, v = power_iter(A)
print("dominant lambda ≈", lam)       # ≈ 5.0
import numpy as np

# =========  SVD of an arbitrary (non-square) matrix  =========
A = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0],
              [7.0, 8.0, 9.0],
              [10., 11., 12.]])   # 4x3, rank 2 (columns are collinear in a 2-plane)

U, s, Vt = np.linalg.svd(A, full_matrices=False)
print("singular values:", s)         # e.g. [25.5, 1.29, ~0]
print("rank ≈", np.sum(s > 1e-10))   # 2

# Reconstruct A from its SVD
A_rebuilt = U @ np.diag(s) @ Vt
print(np.allclose(A_rebuilt, A))      # True

# Rank-1 approximation (Eckart–Young)
A1 = s[0] * np.outer(U[:, 0], Vt[0])
print("rank-1 error:", np.linalg.norm(A - A1))

# =========  PCA in a dozen lines  =========
def pca(X, k):
    # X: (N, d) data matrix
    mu = X.mean(axis=0, keepdims=True)
    Xc = X - mu                       # center the data
    U, s, Vt = np.linalg.svd(Xc, full_matrices=False)
    components = Vt[:k]               # (k, d) — principal axes
    scores     = Xc @ components.T    # (N, k) — compressed coords
    var_explained = (s[:k] ** 2) / (s ** 2).sum()
    return scores, components, var_explained, mu

# Try it on random 100-d data with hidden 5-d structure
rng = np.random.default_rng(0)
latent = rng.standard_normal((1000, 5))
W      = rng.standard_normal((5, 100))
X      = latent @ W + 0.01 * rng.standard_normal((1000, 100))

scores, comps, var_exp, mu = pca(X, k=5)
print("variance captured by top-5:", var_exp.sum())   # >0.999

# Reconstruct
X_hat = scores @ comps + mu
print("rmse:", np.sqrt(((X - X_hat) ** 2).mean()))

21. Summary

See also

Multivariable calculus

Gradients, Jacobians, Hessians. Calculus becomes linear algebra once you're in more than one dimension.

Numerical analysis

How you actually solve these problems on a computer with finite precision: LU, QR, Cholesky, iterative methods.

Self-attention

Attention is a scaled dot-product followed by a softmax. Section 4 was a spoiler.

Embeddings

Dense vector representations of words, images, and anything else. Cosine similarity all the way down.

Foundation models

Trillion-parameter matrices, trained by gradient descent on a next-token loss. Linear algebra at scale.

Gradient descent

How you fit those matrices. Gradients are just stacked partial derivatives — vectors and matrices again.

Optimization

Convex and non-convex optimization, constrained problems, Lagrange multipliers. Linear algebra is the substrate.

Further reading

  • Gilbert Strang, Introduction to Linear Algebra (6th ed., 2023). The canonical textbook. Strang is the reason most of the world thinks about the "four fundamental subspaces" the way they do. His MIT 18.06 lectures are on YouTube and are the best free linear algebra course that exists.
  • Sheldon Axler, Linear Algebra Done Right (4th ed., 2024). The opposite pedagogical choice: starts from linear maps and vector spaces, treats determinants as an afterthought. Better if you already have one linear-algebra course under your belt and want the clean axiomatic version.
  • 3Blue1Brown, Essence of Linear Algebra (YouTube, 2016). A 15-video visual series that does in two hours what a textbook takes a semester to do. Watch it first if you've never seen any of this. 3blue1brown.com/topics/linear-algebra
  • Trefethen & Bau, Numerical Linear Algebra (1997). The reference for how you actually compute SVDs, QRs, and eigendecompositions on finite-precision hardware. Bridge to numerical analysis.
  • Deisenroth, Faisal & Ong, Mathematics for Machine Learning (Cambridge, 2020). A free PDF that covers linear algebra, calculus, and probability specifically for ML. Good if you want everything in one book.
NEXT UP
→ Multivariable calculus

Once your variables are vectors instead of scalars, derivatives become gradients and Jacobians — and you're already halfway to understanding backpropagation. This is the natural follow-on to linear algebra for anyone heading toward ML.