Overview // what you still need
CS isn't only for programmers. A computational biologist hits the wall when a quadratic sequence aligner can't scale to a million reads. A physicist wastes weeks because their simulation's inner loop is O(n²) when it could be O(n log n). A product manager nods along when someone says "that feature is NP-hard, so we'll need a heuristic" without knowing whether to believe them. A student reads a paper on protein folding or climate modelling and stalls on the big-O claims. All four people need the same ideas.
This track exists to cover those blind spots. It isn't a full undergraduate CS curriculum — there's no operating systems, no networks, no databases, no compilers. What's here is the theoretical core, the stuff that applies whether you're running a chemistry simulation, optimizing a supply chain, building a trading model, or shipping a web service. You'll find one hub page (this one) for the skim, and one deep-dive page per subject for the read.
The goal of this page is modest: give you an accurate mental model of each of the four subjects, a couple of interactive demos that let you see the key ideas, and a clear jump-off to the full page when you want the proofs and code. Don't try to read the whole thing in one sitting. Skim it top-to-bottom once, bookmark the section you needed, and come back to that section with a coffee.
01 · Algorithms & Data Structures. The working toolkit — arrays, hash tables, trees, heaps, graphs, and the recipes (sorting, searching, BFS/DFS, dynamic programming) that act on them.
02 · Complexity & Big-O. A notation for talking about cost without getting lost in hardware. Read the code, read the bound, know in advance whether it'll scale.
03 · Theoretical CS. Turing machines, the halting problem, P vs NP. What machines can and can't do — and which problems are expected to be forever hard.
04 · Mathematical Optimization. Pick numbers to minimize a cost, subject to constraints. Gradient descent, convexity, KKT, duality. The math behind portfolio allocation, circuit design, logistics routing, and every machine-learning training loop.
What this track is, and isn't
- Is: a compact tour of the classical CS theory that researchers, scientists, engineers, and curious readers actually reach for in 2026, with inline demos and links to full textbook pages.
- Isn't: a replacement for CLRS, Sipser, or Boyd & Vandenberghe. When you need depth, the deep-dive pages cite them.
- Isn't: a systems curriculum. No OS, networks, databases, or compilers. Those are fine subjects; they just live elsewhere.
- Assumes: you can read pseudocode, have seen a for-loop, and remember roughly what a function is. Everything past that is fair game to re-explain.
When you're ready for the companion tracks, the Math hub covers the prerequisites (discrete math, linear algebra, calculus, probability) and the AI/ML hub is where the optimization and complexity show up in anger.
Pick your path // by goal, not by subject
Four common reasons people land on a CS fundamentals page. Pick the one that looks most like you and follow its reading order — you can always chase cross-links later.
"I want to pass a coding interview."
Start at Algorithms & Data Structures below, then Complexity & Big-O. The interactive binary-search and BFS demos on this page are the same shapes the interviewer will draw on the whiteboard. When you want depth, algorithms.html has every staple with runnable code.
"I want to understand why P vs NP matters."
Skim Complexity for the Big-O language, then read Theoretical CS for Turing machines, halting, and the P vs NP setup. The Turing-machine demo on this page walks through one step at a time; theoretical-cs.html has the full proof sketches and the NP-complete zoo.
"I want to understand ML optimization."
Prereq refresh at multivariable calculus and linear algebra, then the Optimization section below. The gradient-descent-on-a-bowl demo is the core intuition. Then cross over to Gradient Descent and Backpropagation on the ML track.
"I want to reason about big systems."
Start at Complexity so you can read code at scale, then Algorithms to recognize which recipe fits which shape of data. The amortized-analysis bit is the one people always miss and always need. Finish with complexity.html for the recurrence-solving chops.
Every section on this page ends with a Deep Dive link in its heading — that's your jump point when the skim isn't enough.
Algorithms & Data Structures Deep Dive →// the working toolkit
The containers, traversals, and recipes that turn "I could solve this in a week of loops" into "I solved it in ten lines and it runs on a billion inputs."
The single most useful thing you can learn from algorithms is not the code — it's the recognition reflex. A great engineer looks at a new problem and instantly matches it to a class they've seen before: "this is a shortest-path problem, reach for Dijkstra," "this is a dedup, reach for a set," "this is top-k, reach for a heap." The rest is bookkeeping. Everything in this section is in service of building that reflex. The full textbook treatment is in algorithms.html; here we'll do the fast tour.
Arrays & hash tables
Arrays are the floor. A contiguous block of memory you index in O(1). Dynamic arrays (Python list, Go slice, C++ vector) give you O(1) amortized append by doubling the underlying buffer when full — the doubling is the clever bit, and you'll see why in the complexity section. Hash tables layer on top: hash a key to a bucket, resolve collisions with chaining or open addressing, and you get expected O(1) lookup, insert, and delete. About 80% of "make it fast" interview answers are "put it in a hash table." The other 20% are the rest of this section.
When hash tables lie: adversarial inputs can force every key into the same bucket and your O(1) becomes O(n). Languages that care (Rust, Python 3.4+) randomize their hash seed per-process to make this hard to exploit. If you're writing something that takes user-controlled keys, this matters.
Trees & heaps
A binary search tree keeps keys in order so you can find, insert, and delete in O(log n) — provided it stays balanced. Real-world BSTs are balanced via red-black (C++ std::map, Java TreeMap) or AVL rotations; on disk, B-trees fan out to hundreds of children per node so the height stays tiny. A binary heap is a different animal: it only maintains the "parent ≤ children" invariant, which is enough to extract the min in O(log n) without paying to keep everything sorted. That's why every priority queue you've ever used is a heap.
Graphs & traversal
A graph is vertices and edges, nothing more. The trick is the representation: an adjacency list is a list-of-lists (one list of neighbors per vertex) and is the right default unless your graph is dense. Given a graph, three traversals matter. BFS uses a queue and visits in layers — it finds shortest paths in unweighted graphs for free. DFS uses a stack (or recursion, which is the same thing) and goes deep before wide — it's the engine under topological sort and cycle detection. Dijkstra is BFS with a priority queue for weighted edges, giving you shortest paths in O((V + E) log V).
Sorting
There are three sorts you should know by heart. Mergesort splits the array in half, recursively sorts each half, and merges them in O(n); the recurrence T(n) = 2T(n/2) + O(n) solves to O(n log n). Quicksort partitions around a pivot and recurses on each side; it's O(n log n) on average and O(n²) worst-case, but with random pivots the worst case is astronomically unlikely and the constants are so small that it's faster than mergesort in practice. Heapsort builds a heap and repeatedly extracts the min, giving deterministic O(n log n) without quicksort's variance. You can't beat n log n with comparisons — there's a formal lower bound via decision trees. You can beat it if you're willing to use radix or counting sort and your keys have bounded range.
Searching
Binary search is the first non-trivial algorithm most people learn, and it's still the one that wins the most wall-clock time in practice. Half the array per step gives you O(log n) — that's 20 steps for a million elements, 30 for a billion. The classic mistakes are off-by-one errors in the midpoint and forgetting that the array has to be sorted first. Two cousins earn their own mention: the two-pointer pattern (one pointer from each end, walk them toward each other), and the sliding window (a variable-length window scanning the array once). Both are ways to turn a nested-loop brute force into a single pass.
Dynamic programming
The scariest-looking technique that is, mechanically, the simplest: if your problem has optimal substructure (the best solution uses best solutions to sub-problems) and overlapping sub-problems (the same sub-problem shows up many times), cache the sub-problem answers. That's it. The rest is craft. You can cache top-down with memoization (recursive + hash table) or bottom-up with tabulation (a for-loop filling an array). Classic problems: Fibonacci, longest common subsequence, 0/1 knapsack, edit distance, matrix chain multiplication, coin change. If you can write the recursive brute force and identify the state, you can almost always DP-ify it.
Pick the right container
Array / dynamic array
Default for ordered collections you'll index into. Not great if you insert/delete in the middle often.
Hash table
When you need "does this exist?" or "what's the value for this key?" as fast as possible, and order doesn't matter.
Balanced BST
When you need hash-table-ish lookup and you need to iterate in key order, or do range queries.
Binary heap
When you keep grabbing the smallest (or largest) element and everything else can wait. Dijkstra, top-k, task schedulers.
Adjacency list
The right default for graphs. Switch to an adjacency matrix only when the graph is dense (edges ~ V²).
Trie
When the keys are strings and you want prefix queries (autocomplete, routing, IP-prefix lookups).
Union-find
Tracking connected components as edges arrive. The backbone of Kruskal's MST and most incremental graph code.
Bloom filter
"Have I seen this before, probably?" Used when a real set would blow your memory budget.
Search target
Traversal
A. Ties broken alphabetically.Prereqs live in discrete math (graph terminology, induction, Big-O arithmetic). For a much deeper treatment, jump to the full algorithms page — every container on this cheat-sheet gets its own section with runnable Python.
Complexity & Big-O Deep Dive →// reading code at scale
A notation for talking about running time without getting tangled in hardware, language, or compiler details.
What Big-O actually measures
When we say an algorithm is O(n²), we mean: as n grows large, the running time grows no faster than some constant times n². That's an asymptotic upper bound. The constant is hidden on purpose, because it's the part that changes when you port your code to a new laptop. What doesn't change is the shape of the growth — and that's what determines whether your code scales to a million or a billion. A linear-time algorithm is always going to beat a quadratic one for large enough inputs; the only question is how large.
Three related symbols show up in papers. O(f) is the upper bound we just described. Ω(f) is a lower bound — the algorithm takes at least that long in the worst case. Θ(f) is both — a tight bound, upper and lower matching. Most casual talk says "this is O(n log n)" when the speaker means Θ; be generous, it usually doesn't matter.
Big-O — formally
- f(n)
- The actual running time or operation count of the algorithm as a function of input size.
- g(n)
- A simpler yardstick function — typically 1, log n, n, n log n, n², or 2ⁿ.
- n
- The size of the input — list length, graph vertex count, whatever grows.
- c
- Any positive constant. Big-O hides constants on purpose so the bound survives a move to a different machine.
- n₀
- The threshold past which the bound holds. Big-O is about behavior when n is large; small-n shenanigans are ignored.
- ⟺, ∃, ∀
- "If and only if", "there exists", "for all" — logic shorthand.
Analogy. Big-O is like saying "my rent grows at most linearly with apartment size." It doesn't tell you the rent in dollars — that depends on the city. It tells you the shape of the growth, which is the part that survives when you move.
Amortized analysis & the doubling trick
Push to a Python list a million times. Each append is "O(1)" — except, occasionally, when the underlying array fills up, it has to allocate a bigger array and copy everything over. That copy is O(n). So how is append O(1)? On average, amortized over a long sequence of operations. The trick is that the capacity doubles (not increases by one) on each resize, so resizes happen at n = 1, 2, 4, 8, 16, … The total cost of all the resizes up through n is 1 + 2 + 4 + … + n ≈ 2n, and there are n appends, so the average is ≤ 3 per append. Constant. This is the canonical amortized-analysis argument: one expensive operation can be hidden inside many cheap ones.
The Master Theorem
Divide-and-conquer algorithms give you recurrences of the form T(n) = a · T(n/b) + O(nᵏ) — "I split the problem into a sub-problems each of size n/b, and gluing them back together costs nᵏ." The Master Theorem tells you which term dominates:
- If a > bᵏ, the sub-problems dominate and T(n) = Θ(n^(log_b a)).
- If a = bᵏ, they're balanced and T(n) = Θ(nᵏ log n).
- If a < bᵏ, the combine step dominates and T(n) = Θ(nᵏ).
For mergesort, a = 2, b = 2, k = 1, so a = bᵏ and you get Θ(n log n). For Strassen's matrix multiply, a = 7, b = 2, k = 2, so a > bᵏ and you get Θ(n^(log₂ 7)) ≈ Θ(n^2.807). That's how you beat the naive O(n³).
The decision-tree lower bound
Here's a question that trips most people up: why can't you sort faster than n log n? The answer is a lower bound, not a missing trick. Any comparison-based sort is a decision tree: each internal node compares two elements, each leaf is a permutation of the input. To distinguish among n! possible input permutations, the tree must have at least n! leaves, so its depth is at least log₂(n!) ≈ n log n (by Stirling). Any comparison sort executes at least one root-to-leaf path, so it does at least n log n comparisons in the worst case. End of story — unless you stop comparing and start exploiting key structure (radix sort, counting sort).
How fast is "fast"? A cheat sheet
A modern CPU does something like 10⁹ simple operations per second. Here's what that buys you, approximately, for each growth class — the "n you can handle in one second" column is the one to burn into your memory.
| Growth | Representative algorithm | n you can handle in ~1 second |
|---|---|---|
| O(log n) | Binary search | Any n you can represent (~10¹⁸) |
| O(n) | Linear scan, hash lookups, BFS | ~10⁸ |
| O(n log n) | Mergesort, heap operations, FFT | ~10⁷ |
| O(n²) | Bubble sort, nested loops, Floyd-Warshall | ~10⁴ |
| O(n³) | Naive matrix multiply, all-pairs shortest path | ~500 |
| O(2ⁿ) | Subset sum brute force, TSP DP | ~25 |
| O(n!) | Permutation enumeration, naive TSP | ~11 |
Input size
When you're staring at code in review, ask two questions: "what's n here, and what shape is the work?" If you find a nested loop where both levels go up to n, you've got a quadratic and you should know whether that's OK. If the outer loop is n and the inner loop is log n (say, a binary search), you've got n log n and you're fine. The log n factor is the one that almost never matters.
Theoretical CS Deep Dive →// what machines can and can't do
What can a computer do at all, and what can it do efficiently? The answers turn out to be "less than you'd expect," in both senses.
The Turing machine
In 1936, Alan Turing — 23 years old, a fresh Cambridge graduate — wrote down the simplest possible mathematical model of a computer. One infinite tape of symbols. A read-write head that can move one cell at a time. A finite-state controller. A transition rule: "if I'm in state q and the head reads symbol s, write s', move the head left or right, and enter state q'." That's the whole machine. And every program you will ever run — every LLM, every game engine, every operating system — is reducible to something that could, in principle, be simulated on this one-tape model. That claim is the Church-Turing thesis, and no one has found a counter-example in nearly 90 years.
The reason the Turing machine matters is not that it's efficient — it's that it's universal. Any question about what computers can do can be rephrased as a question about what Turing machines can do. That move lets theoretical CS make statements of the form "no machine of any kind can ever solve X," which is a stronger statement than "no machine we currently know can solve X."
The halting problem
Turing's headline result: there is no program that can, in general, tell whether another program halts. The proof is a one-page diagonal argument and it's worth reading at least once in your life. Sketch: suppose a universal halting-checker H(P, x) existed, returning true iff program P halts on input x. Build a new program D that reads program P as input and does: "if H(P, P) says halts, loop forever; else halt." Now run D(D). If it halts, H said it loops — contradiction. If it loops, H said it halts — contradiction. Therefore H cannot exist. The trick is the diagonal (running D on itself), and it's the same move Cantor used to show the reals are uncountable.
# Assume such an H exists:
def H(P, x):
return True if P halts on x else False
# Build D:
def D(P):
if H(P, P):
while True: pass # loop forever
else:
return # halt immediately
# Now run D(D).
# If D(D) halts, then H(D,D) was True → D(D) loops. Contradiction.
# If D(D) loops, then H(D,D) was False → D(D) halts. Contradiction.
# So H cannot exist.
A huge number of "can I check at build time whether my program does X?" questions turn out to be the halting problem in disguise — type inference for unrestricted languages, program equivalence, virus detection from source code. When a compiler says "sorry, that's undecidable in general, here's a conservative approximation," the halting problem is usually what's hiding in the background.
P vs NP
Once you've accepted that some problems are unsolvable at all, the next natural question is: among the solvable ones, which are efficient? Theoretical CS organizes problems into complexity classes, and two matter more than the rest:
- P — problems solvable in polynomial time. O(n), O(n log n), O(n²), O(n¹⁰⁰) — all in P. These are the "efficient" problems.
- NP — problems whose solutions can be verified in polynomial time. Given a candidate answer, you can check it fast, even if finding it is hard. Sudoku, SAT, TSP, graph coloring — all in NP.
Every problem in P is in NP (if you can solve it quickly, you can verify a solution quickly). The million-dollar question is the reverse: is every problem in NP also in P? P = NP? If yes, then every problem whose solutions you can verify, you can also find — which would break cryptography, revolutionize optimization, and make a lot of hard problems trivial. Almost everyone believes P ≠ NP, but no one has proved it. The Clay Millennium Prize offers $1M to whoever does.
Reductions & NP-completeness
A reduction from problem A to problem B is a polynomial-time transformation that turns any instance of A into an equivalent instance of B. If such a reduction exists, we say "A reduces to B," written A ≤ₚ B, and we know B is at least as hard as A — if you could solve B efficiently, you'd solve A efficiently too by just running the transformation first.
In 1971 Stephen Cook (and independently Leonid Levin) proved that the Boolean satisfiability problem SAT is NP-hard: every problem in NP reduces to SAT. That single result anchored everything. Once you have one NP-hard problem, you can prove a new problem is NP-hard just by reducing SAT to it — or any other known-hard problem to it. Karp's 1972 paper then took 21 classic combinatorial problems and reduced them to each other, establishing most of the NP-complete canon: 3-SAT, Clique, Vertex Cover, Hamiltonian Cycle, TSP, Subset Sum, Graph Coloring. If P ≠ NP, all of these are forever hard in the worst case, and we settle for approximations and heuristics.
Input tape
For the full proofs, the NP-complete zoo, and a longer TM tutorial with two interactive figures, jump to theoretical-cs.html. The prereq is discrete math (sets, functions, proof by contradiction). The practical consequences of NP-hardness show up in the optimization track below — we solve NP-hard problems every day, just not optimally.
Mathematical Optimization Deep Dive →// the math under ML
Pick variables to make a number as small as possible, subject to rules. That's the entire field — and it's the engine under every neural network, every airline schedule, every routing system.
Every optimization problem has the same three pieces: an objective (the number you want to minimize), variables (the knobs you can turn), and constraints (the rules the variables must obey). Training a neural net? Objective is the loss, variables are the weights, constraints are usually "none" (or "non-negative" for some regularizers). Routing trucks? Objective is total distance, variables are the edges you use, constraints are "every city visited exactly once." Same math, different clothes.
Prerequisites: this section leans on multivariable calculus (partial derivatives, the gradient ∇f, the chain rule) and a little linear algebra (dot products, matrix-vector operations). If those feel rusty, skim them before continuing.
Convex vs non-convex
A function is convex if the line between any two points on its graph lies on or above the graph — think of a bowl. Convex functions are the safe zone of optimization: every local minimum is a global minimum, so any algorithm that just "walks downhill" is guaranteed to find the best answer. Linear, quadratic, and exponential functions are convex; so is the sum of convex functions, and the max of convex functions. Logistic regression is convex. Support vector machines are convex. Linear and quadratic programs are convex.
Neural networks are not convex. Their loss landscape is a high-dimensional mess with countless local minima, saddle points, and flat regions. The miracle of modern deep learning is that, empirically, the local minima SGD finds in that landscape tend to be almost as good as the global minimum would be. Nobody fully understands why, but it works — we've built a trillion-dollar industry on it.
Gradient descent, SGD, Adam
The simplest optimization algorithm that works: compute the gradient ∇f (the direction of steepest ascent), take a small step in the opposite direction, repeat. That's gradient descent. The step size is a hyperparameter called the learning rate — too small and training drags, too large and you overshoot and diverge. Stochastic gradient descent (SGD) is the variant used in ML: instead of computing the true gradient over the whole dataset (which would be prohibitively expensive), estimate it from a random mini-batch. Noisier steps, vastly cheaper per step, surprisingly helpful for escaping saddle points.
Adam is the workhorse. Published by Kingma & Ba in 2014, it keeps a per-parameter exponential moving average of both the gradient (momentum) and the squared gradient (adaptive scale), and uses the ratio to pick a step size that is small for wildly-varying parameters and large for slow-moving ones. Ninety percent of production training runs still use Adam or its close relative AdamW. It isn't optimal — pure SGD with a warmup schedule sometimes beats it — but it's forgiving of bad hyperparameters, and that's worth a lot.
Vanilla gradient descent
- x_t
- The current parameter vector (the weights, in ML) at iteration t.
- ∇f(x_t)
- The gradient of the objective f at x_t — a vector pointing in the direction of steepest ascent.
- η (eta)
- The learning rate. Usually a small positive number like 0.01 or 10⁻⁴.
- x_{t+1}
- The new parameter vector — one step downhill.
Analogy. You're blindfolded on a hillside and want to reach the valley. Feel the slope under your feet, take a small step in the downhill direction, repeat. If the hill is a bowl (convex), you're guaranteed to reach the bottom. If it's the Rockies (non-convex), you'll reach a bottom, not necessarily the deepest one.
Linear & KKT
When your objective and constraints are all linear functions of the variables, you have a linear program (LP). LPs are the success story of applied optimization: Dantzig's 1947 simplex method solves million-variable LPs in seconds, and modern interior-point methods handle billions. Airlines schedule flights as LPs. Factories plan production as LPs. Hedge funds rebalance portfolios as LPs. Whenever you can massage a problem into "minimize cᵀx subject to Ax ≤ b," you're done — the tooling is that good.
For problems with both constraints and nonlinear objectives, the general recipe is the KKT conditions (Karush-Kuhn-Tucker). Informally: at any constrained optimum, the gradient of the objective is a non-negative linear combination of the gradients of the active constraints. That sounds abstract until you realize it's how SVMs get solved, how Lagrangian mechanics gets derived, and how constrained neural nets (with weight clipping, norm bounds, or output simplex) are trained.
Duality
Every minimization problem has a dual maximization problem built from the same data. Weak duality says the dual's best value is always ≤ the primal's best value. For convex problems, strong duality usually holds: they're equal. That's more useful than it looks. The dual can be easier to solve, it gives you a certificate of optimality, and in SVM-land the dual is where the "kernel trick" lives — kernels only make sense because the dual formulation of SVMs is expressed entirely in terms of dot products, which you can replace with kernel functions. Linear programming duality has its own folklore: every LP has a dual LP that is an LP in its own right, and the simplex method solves both simultaneously.
Start & learning rate
The full treatment — Newton's method, quasi-Newton (BFGS), Lagrange multipliers, integer programming, convex optimization, duality gap, interior-point methods — is in optimization.html. For the ML-facing version with worked backprop and training loops, see Gradient Descent and Backpropagation on the AI/ML track.