Convolution
The operation that taught neural networks to see. A small learnable filter slides across an image, computing dot products — and somehow, this turns out to be the right shape for understanding pictures.
1. Why CNNs exist
A grayscale image at 224×224 resolution has $224 \times 224 = 50{,}176$ pixels. A fully-connected layer mapping this image to a hidden layer of size 1000 would need $50{,}176 \cdot 1000 \approx 5 \times 10^7$ weights. That's already huge — and this is just the first layer of one small network. Worse, the network learns from scratch that the pixel at position $(10, 20)$ and the pixel at position $(11, 20)$ are neighbors, that shifting the image by one pixel shouldn't change the classification, and so on. Every translation invariance has to be learned from data.
Convolutional layers encode three assumptions about images directly into the architecture, and in doing so they slash the parameter count and dramatically improve sample efficiency:
- Locality. A pixel is most related to its neighbors. You don't need to connect every pixel to every hidden unit — you only need small local patches.
- Translation invariance. A cat is a cat regardless of where in the frame it appears. The same pattern-detector should apply everywhere.
- Hierarchy of features. Complex features (a cat's face) are composed of simpler ones (eyes, nose, ears), which are composed of simpler ones still (edges, corners, textures).
The convolution operation delivers the first two directly: a single small filter is applied (with shared weights) at every location in the image. Stacking convolutional layers delivers the third: deeper layers see wider "receptive fields" built from the outputs of earlier layers.
A 3×3 convolutional layer processing a 224×224 image with 64 output channels has $3 \times 3 \times 64 = 576$ weights. A fully-connected layer with the same number of output units would have $50{,}176 \times 64 \approx 3.2 \times 10^6$ weights. That's a 5500× parameter reduction — for free, by encoding the structure of images.
2. 1D convolution
Before tackling images, let's define convolution on a 1D sequence. Let $x$ be a signal of length $N$ and $k$ a filter ("kernel") of length $K$. The (discrete) cross-correlation — what deep learning actually computes, and confusingly calls "convolution" — is:
Formal mathematical convolution flips the kernel: $y[i] = \sum_j k[j] \cdot x[i - j]$. Cross-correlation does not: $y[i] = \sum_j k[j] \cdot x[i + j]$. Deep learning calls both "convolution" but always computes cross-correlation. The distinction matters only if you care about sign conventions when transferring formulas from signal processing textbooks. Since the kernel is learned, flipping the indices just learns a flipped kernel — the operation is the same up to a reindexing.
$x = [1, 2, 3, 4, 5]$, $k = [1, 0, -1]$ (a crude derivative filter).
$y[0] = 1\cdot 1 + 0\cdot 2 + (-1)\cdot 3 = -2$
$y[1] = 1\cdot 2 + 0\cdot 3 + (-1)\cdot 4 = -2$
$y[2] = 1\cdot 3 + 0\cdot 4 + (-1)\cdot 5 = -2$
Output $y = [-2, -2, -2]$. Every output value is the signed difference across a width-3 window — exactly what a finite-difference derivative computes.
Notice the output has length $N - K + 1 = 5 - 3 + 1 = 3$. The filter fits into the signal in 3 distinct positions. We'll recover the general formula in the padding section below.
Source — 1D Convolution
function conv1d(x, k, padding="valid"):
# x: signal of length N
# k: kernel of length K
# cross-correlation: y[n] = sum_{m=0}^{K-1} k[m] * x[n+m]
if padding == "same":
pad = (K - 1) / 2
x = pad_zeros(x, left=pad, right=pad)
N = length(x)
out_len = N - K + 1
y = zeros(out_len)
for n in 0..out_len-1:
for m in 0..K-1:
y[n] += k[m] * x[n + m]
return yimport numpy as np
def conv1d(x, k, padding="valid"):
"""
Naive 1D cross-correlation (what deep learning calls 'convolution').
x: 1D array of length N
k: 1D kernel of length K
padding: 'valid' (no pad) or 'same' (output length == input length)
"""
K = len(k)
if padding == "same":
pad = (K - 1) // 2
x = np.pad(x, pad)
N = len(x)
out_len = N - K + 1
y = np.zeros(out_len)
for n in range(out_len):
y[n] = np.dot(k, x[n:n + K])
return y
# Example
x = np.array([1, 2, 3, 4, 5], dtype=float)
k = np.array([1, 0, -1], dtype=float)
print(conv1d(x, k)) # valid: [-2. -2. -2.]
print(conv1d(x, k, "same")) # same: [-2. -2. -2. 0. 0.]function conv1d(x, k, padding = 'valid') {
const K = k.length;
if (padding === 'same') {
const pad = Math.floor((K - 1) / 2);
x = [...new Array(pad).fill(0), ...x, ...new Array(pad).fill(0)];
}
const outLen = x.length - K + 1;
const y = new Array(outLen).fill(0);
for (let n = 0; n < outLen; n++) {
for (let m = 0; m < K; m++) {
y[n] += k[m] * x[n + m];
}
}
return y;
}
// Example
const x = [1, 2, 3, 4, 5];
const k = [1, 0, -1];
console.log(conv1d(x, k)); // [-2, -2, -2]
console.log(conv1d(x, k, 'same')); // [-2, -2, -2, 0, 0]#include <string.h>
/* valid padding only; caller provides output array of length N - K + 1 */
void conv1d(const double *x, int N, const double *k, int K, double *y) {
int out_len = N - K + 1;
memset(y, 0, out_len * sizeof(double));
for (int n = 0; n < out_len; n++)
for (int m = 0; m < K; m++)
y[n] += k[m] * x[n + m];
}
/* same padding: zero-pad input, output length == N */
void conv1d_same(const double *x, int N, const double *k, int K, double *y) {
int pad = (K - 1) / 2;
int padded_len = N + 2 * pad;
double padded[padded_len];
memset(padded, 0, sizeof(padded));
for (int i = 0; i < N; i++) padded[i + pad] = x[i];
conv1d(padded, padded_len, k, K, y);
}#include <vector>
using namespace std;
vector<double> conv1d(vector<double> x, const vector<double>& k,
const string& padding = "valid") {
int K = k.size();
if (padding == "same") {
int pad = (K - 1) / 2;
x.insert(x.begin(), pad, 0.0);
x.insert(x.end(), pad, 0.0);
}
int outLen = x.size() - K + 1;
vector<double> y(outLen, 0.0);
for (int n = 0; n < outLen; n++)
for (int m = 0; m < K; m++)
y[n] += k[m] * x[n + m];
return y;
}public class Conv1D {
public static double[] conv1d(double[] x, double[] k, String padding) {
int K = k.length;
if ("same".equals(padding)) {
int pad = (K - 1) / 2;
double[] padded = new double[x.length + 2 * pad];
System.arraycopy(x, 0, padded, pad, x.length);
x = padded;
}
int outLen = x.length - K + 1;
double[] y = new double[outLen];
for (int n = 0; n < outLen; n++)
for (int m = 0; m < K; m++)
y[n] += k[m] * x[n + m];
return y;
}
}package conv
func Conv1D(x, k []float64, padding string) []float64 {
K := len(k)
if padding == "same" {
pad := (K - 1) / 2
padded := make([]float64, len(x)+2*pad)
copy(padded[pad:], x)
x = padded
}
outLen := len(x) - K + 1
y := make([]float64, outLen)
for n := 0; n < outLen; n++ {
for m := 0; m < K; m++ {
y[n] += k[m] * x[n+m]
}
}
return y
}3. 2D convolution on images
An image is just a 2D array, and 2D convolution is the straightforward generalization. For an $H \times W$ input and a $K_h \times K_w$ kernel:
In plain English: place the kernel on the image so its top-left corner is at $(i, j)$; multiply each kernel entry by the image pixel beneath it; sum everything. That's one output pixel. Then slide the kernel over by one position and do it again. Keep sliding until you've covered the image.
Interactive: watch a kernel slide across an image
The 5×5 input below represents a tiny image. The 3×3 kernel slides over it; at each position, the nine products are computed and summed. Press Step → to advance to the next output cell. Click any input cell to edit its value, and pick a kernel preset to see different feature detectors in action.
READY
The yellow box is the current kernel position. Each step multiplies 9 pairs and sums them — the result becomes one output pixel.
Source — 2D Convolution / Cross-Correlation
function conv2d(image, kernel, padding="valid"):
# image: H x W
# kernel: Kh x Kw
# y[i,j] = sum_{u,v} kernel[u,v] * image[i+u, j+v]
if padding == "same":
ph = (Kh - 1) / 2
pw = (Kw - 1) / 2
image = zero_pad(image, top=ph, bottom=ph, left=pw, right=pw)
H, W = shape(image)
out_h = H - Kh + 1
out_w = W - Kw + 1
y = zeros(out_h, out_w)
for i in 0..out_h-1:
for j in 0..out_w-1:
for u in 0..Kh-1:
for v in 0..Kw-1:
y[i][j] += kernel[u][v] * image[i+u][j+v]
return yimport numpy as np
def conv2d(image, kernel, padding="valid"):
"""
Naive 2D cross-correlation on a grayscale (H x W) image.
padding: 'valid' — output is (H-Kh+1) x (W-Kw+1)
'same' — output is H x W (zero-padded)
"""
Kh, Kw = kernel.shape
if padding == "same":
ph, pw = (Kh - 1) // 2, (Kw - 1) // 2
image = np.pad(image, ((ph, ph), (pw, pw)))
H, W = image.shape
out_h, out_w = H - Kh + 1, W - Kw + 1
y = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
y[i, j] = np.sum(kernel * image[i:i+Kh, j:j+Kw])
return y
# Example — edge detection on a 5x5 image
img = np.array([
[0,0,0,0,0],
[0,1,1,1,0],
[0,1,2,1,0],
[0,1,1,1,0],
[0,0,0,0,0]], dtype=float)
edge = np.array([[-1,-1,-1],[-1,8,-1],[-1,-1,-1]], dtype=float)
print(conv2d(img, edge)) # valid: 3x3
print(conv2d(img, edge, "same")) # same: 5x5function conv2d(image, kernel, padding = 'valid') {
const Kh = kernel.length, Kw = kernel[0].length;
if (padding === 'same') {
const ph = Math.floor((Kh-1)/2), pw = Math.floor((Kw-1)/2);
const H = image.length, W = image[0].length;
const padded = Array.from({length: H+2*ph}, (_, i) =>
Array.from({length: W+2*pw}, (_, j) =>
(i >= ph && i < H+ph && j >= pw && j < W+pw)
? image[i-ph][j-pw] : 0));
image = padded;
}
const H = image.length, W = image[0].length;
const outH = H - Kh + 1, outW = W - Kw + 1;
return Array.from({length: outH}, (_, i) =>
Array.from({length: outW}, (_, j) => {
let s = 0;
for (let u = 0; u < Kh; u++)
for (let v = 0; v < Kw; v++)
s += kernel[u][v] * image[i+u][j+v];
return s;
}));
}#include <string.h>
#include <stdlib.h>
/* valid padding only.
image[H][W], kernel[Kh][Kw], out[(H-Kh+1)][(W-Kw+1)] */
void conv2d_valid(int H, int W, int Kh, int Kw,
double image[H][W], double kernel[Kh][Kw],
double out[H-Kh+1][W-Kw+1])
{
int oh = H-Kh+1, ow = W-Kw+1;
for (int i=0; i<oh; i++)
for (int j=0; j<ow; j++) {
double s = 0;
for (int u=0; u<Kh; u++)
for (int v=0; v<Kw; v++)
s += kernel[u][v] * image[i+u][j+v];
out[i][j] = s;
}
}
/* same padding: pad H x W input to (H+Kh-1) x (W+Kw-1), then convolve */
void conv2d_same(int H, int W, int Kh, int Kw,
double image[H][W], double kernel[Kh][Kw],
double out[H][W])
{
int ph = (Kh-1)/2, pw = (Kw-1)/2;
int pH = H+2*ph, pW = W+2*pw;
double (*padded)[pW] = calloc(pH, sizeof(*padded));
for (int i=0;i<H;i++) for(int j=0;j<W;j++) padded[i+ph][j+pw]=image[i][j];
conv2d_valid(pH, pW, Kh, Kw, padded, kernel, out);
free(padded);
}#include <vector>
using namespace std;
using Mat = vector<vector<double>>;
Mat conv2d(Mat image, const Mat& kernel, const string& padding = "valid") {
int Kh = kernel.size(), Kw = kernel[0].size();
if (padding == "same") {
int ph = (Kh-1)/2, pw = (Kw-1)/2;
int H = image.size(), W = image[0].size();
Mat padded(H+2*ph, vector<double>(W+2*pw, 0));
for (int i=0;i<H;i++) for(int j=0;j<W;j++) padded[i+ph][j+pw]=image[i][j];
image = padded;
}
int H=image.size(), W=image[0].size();
int oh=H-Kh+1, ow=W-Kw+1;
Mat out(oh, vector<double>(ow, 0));
for (int i=0;i<oh;i++) for(int j=0;j<ow;j++) {
double s=0;
for(int u=0;u<Kh;u++) for(int v=0;v<Kw;v++) s+=kernel[u][v]*image[i+u][j+v];
out[i][j]=s;
}
return out;
}public class Conv2D {
public static double[][] conv2d(double[][] image, double[][] kernel, String padding) {
int Kh=kernel.length, Kw=kernel[0].length;
if ("same".equals(padding)) {
int ph=(Kh-1)/2, pw=(Kw-1)/2;
int H=image.length, W=image[0].length;
double[][] padded=new double[H+2*ph][W+2*pw];
for(int i=0;i<H;i++) for(int j=0;j<W;j++) padded[i+ph][j+pw]=image[i][j];
image=padded;
}
int H=image.length, W=image[0].length;
int oh=H-Kh+1, ow=W-Kw+1;
double[][] out=new double[oh][ow];
for(int i=0;i<oh;i++) for(int j=0;j<ow;j++) {
double s=0;
for(int u=0;u<Kh;u++) for(int v=0;v<Kw;v++) s+=kernel[u][v]*image[i+u][j+v];
out[i][j]=s;
}
return out;
}
}package conv
func Conv2D(image [][]float64, kernel [][]float64, padding string) [][]float64 {
Kh, Kw := len(kernel), len(kernel[0])
if padding == "same" {
ph, pw := (Kh-1)/2, (Kw-1)/2
H, W := len(image), len(image[0])
padded := make([][]float64, H+2*ph)
for i := range padded { padded[i] = make([]float64, W+2*pw) }
for i := 0; i < H; i++ { for j := 0; j < W; j++ { padded[i+ph][j+pw] = image[i][j] } }
image = padded
}
H, W := len(image), len(image[0])
oh, ow := H-Kh+1, W-Kw+1
out := make([][]float64, oh)
for i := range out {
out[i] = make([]float64, ow)
for j := 0; j < ow; j++ {
s := 0.0
for u := 0; u < Kh; u++ { for v := 0; v < Kw; v++ { s += kernel[u][v] * image[i+u][j+v] } }
out[i][j] = s
}
}
return out
}4. Famous hand-crafted kernels
Before deep learning, computer vision engineers designed filters by hand. Each one responds to a specific visual pattern. Understanding them gives intuition for what learned convolutional filters do (roughly — the real story is more complicated).
Edge detector
$\begin{bmatrix} -1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1 \end{bmatrix}$
Output is large where the center pixel differs from its neighbors (i.e. at edges) and zero on flat regions. Sum of entries is 0 — flat regions output zero.
Sobel X
$\begin{bmatrix} -1 & 0 & 1 \\ -2 & 0 & 2 \\ -1 & 0 & 1 \end{bmatrix}$
Approximates the horizontal derivative. Large positive response at rising vertical edges, large negative at falling. The Sobel Y filter is its transpose.
Gaussian blur
$\frac{1}{16}\begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{bmatrix}$
Smooths the image by averaging each pixel with its neighbors, weighted by a Gaussian. Entries sum to 1 so average brightness is preserved.
Sharpen
$\begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \end{bmatrix}$
Amplifies the difference between each pixel and its neighbors. Equivalent to "identity + edge enhance." Sums to 1 so brightness is preserved.
5. Padding, stride, dilation
Padding
Convolving an $H \times W$ image with a $K \times K$ kernel naively yields a $(H - K + 1) \times (W - K + 1)$ output. The output shrinks each layer — after several layers you run out of image. To fix this we pad the input with extra rows and columns (usually zeros) before convolving.
- Valid / no padding: no padding added. Output shrinks by $K - 1$ in each dimension.
- Same padding: pad by enough to make the output the same size as the input. For a $K \times K$ kernel with odd $K$, that means $(K - 1)/2$ on each side.
- Full padding: pad by $K - 1$ on each side. Output is larger than input — rarely used.
Stride
Normal convolution slides the kernel by 1 position at a time. A stride of $s$ skips $s$ positions between applications. Stride 2, for instance, halves the output resolution — a cheap way to downsample. Strided convolution is an alternative to pooling for shrinking the spatial dimensions.
Dilation (atrous convolution)
A dilated kernel samples input positions that are $d$ pixels apart instead of adjacent. A $3 \times 3$ kernel with dilation 2 covers a $5 \times 5$ receptive field but still has only 9 weights. Dilation is critical in segmentation networks where you want a large receptive field without extra parameters or downsampling.
6. Output shape formula
Putting padding, stride, and dilation together, the output size of a 1D convolution is:
where $I$ is input size, $P$ is padding (on each side), $K$ is kernel size, $D$ is dilation, $S$ is stride. 2D is identical, computed independently for height and width.
$I = 28$, $K = 3$, $P = 0$, $S = 1$, $D = 1$ $\rightarrow$ $O = \lfloor (28 + 0 - 2 - 1)/1 \rfloor + 1 = 26$
$I = 28$, $K = 3$, $P = 1$, $S = 1$, $D = 1$ (same padding) $\rightarrow$ $O = \lfloor (28 + 2 - 2 - 1)/1 \rfloor + 1 = 28$
$I = 28$, $K = 3$, $P = 1$, $S = 2$, $D = 1$ (strided downsample) $\rightarrow$ $O = \lfloor (28 + 2 - 2 - 1)/2 \rfloor + 1 = 14$
7. Multi-channel convolutions
Real images have 3 color channels (RGB). After the first layer, feature maps typically have dozens or hundreds of channels. A convolutional layer needs to consume all input channels and produce all output channels.
For an input with $C_\text{in}$ channels and an output with $C_\text{out}$ channels, the kernel is a 4D tensor of shape $[C_\text{out}, C_\text{in}, K, K]$. Each output channel is computed as the sum of separate 2D convolutions applied to each input channel, with a different kernel per input channel:
The number of weights in such a layer is $C_\text{out} \cdot C_\text{in} \cdot K \cdot K$, plus $C_\text{out}$ biases.
The first layer of a typical CNN might be: input 3 channels (RGB), output 64 channels, kernel 3×3.
Weights: $64 \times 3 \times 3 \times 3 = 1{,}728$ weights + 64 biases = 1,792 parameters.
For comparison: a fully-connected layer taking a $224 \times 224 \times 3$ image to 64 hidden units would need $\approx 9.6 \times 10^6$ parameters.
1×1 convolutions
A kernel with $K = 1$ has no spatial extent — it just mixes channels at each pixel. 1×1 convolutions are surprisingly useful: they provide a cheap way to change the number of channels, let the network learn per-pixel channel recombinations, and form the backbone of "bottleneck" blocks in ResNets and similar architectures.
Depthwise separable convolutions
A standard 3×3 convolution with $C_\text{in} = C_\text{out} = C$ has $9 C^2$ weights. A depthwise separable convolution factors this into two cheaper operations:
- Depthwise: one 3×3 spatial filter per input channel ($9 C$ weights) — no channel mixing.
- Pointwise: a 1×1 convolution mixing channels ($C^2$ weights).
Total: $9 C + C^2$ weights instead of $9 C^2$ — roughly $9\times$ cheaper for large $C$. This factorization is the core insight of MobileNet, used on every phone camera.
Source — CNN Convolutional Layer
function conv2d_layer(input, W, b, stride=1, padding="valid"):
# input: N x C_in x H x W
# W: C_out x C_in x Kh x Kw (learned filters)
# b: C_out (bias per output channel)
# output: N x C_out x H' x W'
# H' = floor((H - Kh + 2*pad) / stride) + 1
for n in 0..N-1:
for c_out in 0..C_out-1:
for i in 0..H'-1:
for j in 0..W'-1:
s = b[c_out]
for c_in in 0..C_in-1:
for u in 0..Kh-1:
for v in 0..Kw-1:
s += W[c_out][c_in][u][v]
* input[n][c_in][i*stride+u][j*stride+v]
output[n][c_out][i][j] = s
return outputimport numpy as np
def conv2d_layer(x, W, b, stride=1, padding=0):
"""
Forward pass of a single Conv2D layer (no activation).
x: (N, C_in, H, W)
W: (C_out, C_in, Kh, Kw)
b: (C_out,)
stride, padding: integers
Returns: (N, C_out, H', W')
"""
N, C_in, H, Wx = x.shape
C_out, _, Kh, Kw = W.shape
if padding > 0:
x = np.pad(x, ((0,0),(0,0),(padding,padding),(padding,padding)))
H_in, W_in = x.shape[2], x.shape[3]
H_out = (H_in - Kh) // stride + 1
W_out = (W_in - Kw) // stride + 1
out = np.zeros((N, C_out, H_out, W_out))
for n in range(N):
for co in range(C_out):
for i in range(H_out):
for j in range(W_out):
patch = x[n, :, i*stride:i*stride+Kh, j*stride:j*stride+Kw]
out[n, co, i, j] = np.sum(W[co] * patch) + b[co]
return outfunction conv2dLayer(x, W, b, stride = 1, padding = 0) {
// x: [N][C_in][H][W] W: [C_out][C_in][Kh][Kw] b: [C_out]
const N = x.length, Cin = x[0].length;
const Cout = W.length, Kh = W[0][0].length, Kw = W[0][0][0].length;
// zero-pad spatial dims
const xp = x.map(n => n.map(c => {
const H = c.length, Wp = c[0].length;
return Array.from({length: H+2*padding}, (_, i) =>
Array.from({length: Wp+2*padding}, (_, j) =>
(i>=padding && i<H+padding && j>=padding && j<Wp+padding)
? c[i-padding][j-padding] : 0));
}));
const H = xp[0][0].length, Ww = xp[0][0][0].length;
const Ho = Math.floor((H-Kh)/stride)+1, Wo = Math.floor((Ww-Kw)/stride)+1;
const out = Array.from({length:N},()=>Array.from({length:Cout},()=>
Array.from({length:Ho},()=>new Array(Wo).fill(0))));
for(let n=0;n<N;n++) for(let co=0;co<Cout;co++) for(let i=0;i<Ho;i++) for(let j=0;j<Wo;j++){
let s=b[co];
for(let ci=0;ci<Cin;ci++) for(let u=0;u<Kh;u++) for(let v=0;v<Kw;v++)
s+=W[co][ci][u][v]*xp[n][ci][i*stride+u][j*stride+v];
out[n][co][i][j]=s;
}
return out;
}/* Fixed dims example: N=1, C_in=3, C_out=8, H=W=6, Kh=Kw=3, stride=1, no padding */
#include <string.h>
#define N 1
#define CIN 3
#define COUT 8
#define H 6
#define W 6
#define KH 3
#define KW 3
#define HO (H-KH+1)
#define WO (W-KW+1)
void conv2d_layer(
double x[N][CIN][H][W],
double weight[COUT][CIN][KH][KW],
double bias[COUT],
double out[N][COUT][HO][WO])
{
memset(out, 0, sizeof(double)*N*COUT*HO*WO);
for (int n=0;n<N;n++)
for (int co=0;co<COUT;co++)
for (int i=0;i<HO;i++)
for (int j=0;j<WO;j++) {
double s=bias[co];
for (int ci=0;ci<CIN;ci++)
for (int u=0;u<KH;u++)
for (int v=0;v<KW;v++)
s+=weight[co][ci][u][v]*x[n][ci][i+u][j+v];
out[n][co][i][j]=s;
}
}#include <vector>
using namespace std;
// x: [N][Cin][H][W] W4: [Cout][Cin][Kh][Kw] b: [Cout]
// returns [N][Cout][Ho][Wo]
using T4 = vector<vector<vector<vector<double>>>>;
T4 conv2dLayer(const T4& x, const T4& W4, const vector<double>& b,
int stride=1, int padding=0) {
int N=x.size(), Cin=x[0].size(), H=x[0][0].size(), Ww=x[0][0][0].size();
int Cout=W4.size(), Kh=W4[0][0].size(), Kw=W4[0][0][0].size();
// apply padding
T4 xp(N, vector<vector<vector<double>>>(Cin,
vector<vector<double>>(H+2*padding, vector<double>(Ww+2*padding,0))));
for(int n=0;n<N;n++) for(int c=0;c<Cin;c++)
for(int i=0;i<H;i++) for(int j=0;j<Ww;j++)
xp[n][c][i+padding][j+padding]=x[n][c][i][j];
int Ho=(H+2*padding-Kh)/stride+1, Wo=(Ww+2*padding-Kw)/stride+1;
T4 out(N,vector<vector<vector<double>>>(Cout,vector<vector<double>>(Ho,vector<double>(Wo,0))));
for(int n=0;n<N;n++) for(int co=0;co<Cout;co++) for(int i=0;i<Ho;i++) for(int j=0;j<Wo;j++){
double s=b[co];
for(int ci=0;ci<Cin;ci++) for(int u=0;u<Kh;u++) for(int v=0;v<Kw;v++)
s+=W4[co][ci][u][v]*xp[n][ci][i*stride+u][j*stride+v];
out[n][co][i][j]=s;
}
return out;
}public class ConvLayer {
// x[n][cin][h][w] W[cout][cin][kh][kw] b[cout]
public static double[][][][] forward(
double[][][][] x, double[][][][] W, double[] b,
int stride, int padding) {
int N=x.length, Cin=x[0].length, H=x[0][0].length, Ww=x[0][0][0].length;
int Cout=W.length, Kh=W[0][0].length, Kw=W[0][0][0].length;
// pad
double[][][][] xp=new double[N][Cin][H+2*padding][Ww+2*padding];
for(int n=0;n<N;n++) for(int c=0;c<Cin;c++)
for(int i=0;i<H;i++) for(int j=0;j<Ww;j++)
xp[n][c][i+padding][j+padding]=x[n][c][i][j];
int Ho=(H+2*padding-Kh)/stride+1, Wo=(Ww+2*padding-Kw)/stride+1;
double[][][][] out=new double[N][Cout][Ho][Wo];
for(int n=0;n<N;n++) for(int co=0;co<Cout;co++) for(int i=0;i<Ho;i++) for(int j=0;j<Wo;j++){
double s=b[co];
for(int ci=0;ci<Cin;ci++) for(int u=0;u<Kh;u++) for(int v=0;v<Kw;v++)
s+=W[co][ci][u][v]*xp[n][ci][i*stride+u][j*stride+v];
out[n][co][i][j]=s;
}
return out;
}
}package conv
// x [N][Cin][H][W] w [Cout][Cin][Kh][Kw] b [Cout]
// returns [N][Cout][Ho][Wo]
func Conv2DLayer(x [][][][]float64, w [][][][]float64, b []float64, stride, padding int) [][][][]float64 {
N, Cin, H, Ww := len(x), len(x[0]), len(x[0][0]), len(x[0][0][0])
Cout, Kh, Kw := len(w), len(w[0][0]), len(w[0][0][0])
// pad
Hp, Wp := H+2*padding, Ww+2*padding
xp := make([][][][]float64, N)
for n := range xp { xp[n]=make([][][]float64,Cin)
for c := range xp[n] { xp[n][c]=make([][]float64,Hp)
for i := range xp[n][c] { xp[n][c][i]=make([]float64,Wp) } }
for c:=0;c<Cin;c++ { for i:=0;i<H;i++ { for j:=0;j<Ww;j++ {
xp[n][c][i+padding][j+padding]=x[n][c][i][j] } } }
}
Ho, Wo := (H+2*padding-Kh)/stride+1, (Ww+2*padding-Kw)/stride+1
out := make([][][][]float64, N)
for n := range out { out[n]=make([][][]float64,Cout)
for co := range out[n] { out[n][co]=make([][]float64,Ho)
for i := range out[n][co] { out[n][co][i]=make([]float64,Wo) } }
for co:=0;co<Cout;co++ { for i:=0;i<Ho;i++ { for j:=0;j<Wo;j++ {
s:=b[co]
for ci:=0;ci<Cin;ci++ { for u:=0;u<Kh;u++ { for v:=0;v<Kw;v++ {
s+=w[co][ci][u][v]*xp[n][ci][i*stride+u][j*stride+v] } } }
out[n][co][i][j]=s } } }
}
return out
}8. Pooling
Pooling downsamples a feature map, reducing spatial resolution while preserving the strongest responses. Unlike convolution, pooling has no learnable parameters. The two common variants:
Max pooling
Take a window (typically $2 \times 2$, stride 2) and output the maximum value. This preserves the strongest response in each region and adds a small amount of translation invariance. Dominant in the AlexNet/VGG era.
Average pooling
Output the mean of the window. Smoother than max pooling; sometimes used for global downsampling at the very end of a network (global average pooling = one output per channel).
Modern architectures (ResNet and after) often replace pooling with strided convolutions, on the theory that a learned downsampler is better than a fixed one. You'll still see max pooling in many production networks, though.
Source — Max Pooling & Average Pooling
function max_pool2d(x, pool_size, stride):
# x: H x W (single channel for clarity)
# output[i][j] = max of x[i*stride .. i*stride+pool_size-1]
# [j*stride .. j*stride+pool_size-1]
Ho = floor((H - pool_size) / stride) + 1
Wo = floor((W - pool_size) / stride) + 1
for i in 0..Ho-1:
for j in 0..Wo-1:
window = x[i*stride .. i*stride+pool_size,
j*stride .. j*stride+pool_size]
output[i][j] = max(window)
return output
function avg_pool2d(x, pool_size, stride):
# same structure, replace max with mean
output[i][j] = mean(window)import numpy as np
def max_pool2d(x, pool_size=2, stride=2):
"""
2D max pooling on a 2D feature map (H x W).
For multi-channel input (C x H x W) call per channel.
"""
H, W = x.shape
Ho = (H - pool_size) // stride + 1
Wo = (W - pool_size) // stride + 1
out = np.zeros((Ho, Wo))
for i in range(Ho):
for j in range(Wo):
window = x[i*stride:i*stride+pool_size, j*stride:j*stride+pool_size]
out[i, j] = window.max()
return out
def avg_pool2d(x, pool_size=2, stride=2):
H, W = x.shape
Ho = (H - pool_size) // stride + 1
Wo = (W - pool_size) // stride + 1
out = np.zeros((Ho, Wo))
for i in range(Ho):
for j in range(Wo):
window = x[i*stride:i*stride+pool_size, j*stride:j*stride+pool_size]
out[i, j] = window.mean()
return out
# Example
x = np.array([[1,3,2,4],[5,6,7,8],[9,2,3,1],[4,5,6,0]], dtype=float)
print(max_pool2d(x, 2, 2)) # [[6,8],[9,6]]
print(avg_pool2d(x, 2, 2)) # [[3.75,5.25],[5.0,2.5]]function maxPool2d(x, poolSize = 2, stride = 2) {
const H = x.length, W = x[0].length;
const Ho = Math.floor((H - poolSize) / stride) + 1;
const Wo = Math.floor((W - poolSize) / stride) + 1;
return Array.from({length: Ho}, (_, i) =>
Array.from({length: Wo}, (_, j) => {
let max = -Infinity;
for (let u = 0; u < poolSize; u++)
for (let v = 0; v < poolSize; v++)
if (x[i*stride+u][j*stride+v] > max)
max = x[i*stride+u][j*stride+v];
return max;
}));
}
function avgPool2d(x, poolSize = 2, stride = 2) {
const H = x.length, W = x[0].length;
const Ho = Math.floor((H - poolSize) / stride) + 1;
const Wo = Math.floor((W - poolSize) / stride) + 1;
const n = poolSize * poolSize;
return Array.from({length: Ho}, (_, i) =>
Array.from({length: Wo}, (_, j) => {
let s = 0;
for (let u = 0; u < poolSize; u++)
for (let v = 0; v < poolSize; v++)
s += x[i*stride+u][j*stride+v];
return s / n;
}));
}#include <float.h>
/* x[H][W] -> out[Ho][Wo] Ho=(H-P)/S+1 Wo=(W-P)/S+1 */
void max_pool2d(int H, int W, int P, int S, double x[H][W],
int Ho, int Wo, double out[Ho][Wo]) {
for (int i=0;i<Ho;i++) for(int j=0;j<Wo;j++) {
double mx=-DBL_MAX;
for(int u=0;u<P;u++) for(int v=0;v<P;v++) {
double val=x[i*S+u][j*S+v];
if(val>mx) mx=val;
}
out[i][j]=mx;
}
}
void avg_pool2d(int H, int W, int P, int S, double x[H][W],
int Ho, int Wo, double out[Ho][Wo]) {
double n=(double)(P*P);
for(int i=0;i<Ho;i++) for(int j=0;j<Wo;j++){
double s=0;
for(int u=0;u<P;u++) for(int v=0;v<P;v++) s+=x[i*S+u][j*S+v];
out[i][j]=s/n;
}
}#include <vector>
#include <algorithm>
#include <numeric>
using namespace std;
using Mat = vector<vector<double>>;
Mat maxPool2d(const Mat& x, int P=2, int S=2) {
int H=x.size(), W=x[0].size();
int Ho=(H-P)/S+1, Wo=(W-P)/S+1;
Mat out(Ho, vector<double>(Wo));
for(int i=0;i<Ho;i++) for(int j=0;j<Wo;j++){
double mx=-1e18;
for(int u=0;u<P;u++) for(int v=0;v<P;v++) mx=max(mx,x[i*S+u][j*S+v]);
out[i][j]=mx;
}
return out;
}
Mat avgPool2d(const Mat& x, int P=2, int S=2) {
int H=x.size(), W=x[0].size();
int Ho=(H-P)/S+1, Wo=(W-P)/S+1; double n=P*P;
Mat out(Ho, vector<double>(Wo));
for(int i=0;i<Ho;i++) for(int j=0;j<Wo;j++){
double s=0;
for(int u=0;u<P;u++) for(int v=0;v<P;v++) s+=x[i*S+u][j*S+v];
out[i][j]=s/n;
}
return out;
}public class Pooling {
public static double[][] maxPool2d(double[][] x, int P, int S) {
int H=x.length, W=x[0].length;
int Ho=(H-P)/S+1, Wo=(W-P)/S+1;
double[][] out=new double[Ho][Wo];
for(int i=0;i<Ho;i++) for(int j=0;j<Wo;j++){
double mx=Double.NEGATIVE_INFINITY;
for(int u=0;u<P;u++) for(int v=0;v<P;v++) mx=Math.max(mx,x[i*S+u][j*S+v]);
out[i][j]=mx;
}
return out;
}
public static double[][] avgPool2d(double[][] x, int P, int S) {
int H=x.length, W=x[0].length;
int Ho=(H-P)/S+1, Wo=(W-P)/S+1; double n=P*P;
double[][] out=new double[Ho][Wo];
for(int i=0;i<Ho;i++) for(int j=0;j<Wo;j++){
double s=0;
for(int u=0;u<P;u++) for(int v=0;v<P;v++) s+=x[i*S+u][j*S+v];
out[i][j]=s/n;
}
return out;
}
}package conv
import "math"
func MaxPool2D(x [][]float64, P, S int) [][]float64 {
H, W := len(x), len(x[0])
Ho, Wo := (H-P)/S+1, (W-P)/S+1
out := make([][]float64, Ho)
for i := range out {
out[i] = make([]float64, Wo)
for j := 0; j < Wo; j++ {
mx := math.Inf(-1)
for u := 0; u < P; u++ { for v := 0; v < P; v++ {
if x[i*S+u][j*S+v] > mx { mx = x[i*S+u][j*S+v] } } }
out[i][j] = mx
}
}
return out
}
func AvgPool2D(x [][]float64, P, S int) [][]float64 {
H, W := len(x), len(x[0])
Ho, Wo := (H-P)/S+1, (W-P)/S+1
n := float64(P * P)
out := make([][]float64, Ho)
for i := range out {
out[i] = make([]float64, Wo)
for j := 0; j < Wo; j++ {
s := 0.0
for u := 0; u < P; u++ { for v := 0; v < P; v++ { s += x[i*S+u][j*S+v] } }
out[i][j] = s / n
}
}
return out
}9. Why learned filters beat hand-crafted
Visualize the filters in the first layer of a CNN trained on natural images and you will see something striking: many of them look like Gabor filters (oriented edge and blob detectors) — the same filters that 1960s neuroscientists found in the primary visual cortex of cats. The network rediscovers, from data alone, the primitives that biology evolved and that computer vision engineers hand-coded for decades.
Deeper layers respond to more complex patterns: textures, parts, objects. This hierarchy is not programmed; it emerges from stacking simple convolutions and training end-to-end on a classification objective. This is the shift from 2012 onward: instead of designing features, we design architectures and let gradient descent learn the features.
A CNN's receptive field grows linearly with depth in a plain conv-stack and exponentially with dilated convolutions. By layer 10 or so, a neuron "sees" most of the image — but through a hierarchy of simple pattern detectors, not a single giant matrix multiplication.
10. A timeline of classic CNNs
- LeNet-5 (LeCun, 1998) — 7 layers, ~60K parameters. Recognized zip-code digits for the US Postal Service. Architecture: conv → pool → conv → pool → FC → FC → softmax. Everything after is a variation on this theme.
- AlexNet (Krizhevsky, 2012) — 8 layers, ~60M parameters. Won ImageNet 2012 by ten points, kicked off the deep learning revolution. Key tricks: ReLU, dropout, GPU training, data augmentation.
- VGG-16/19 (Simonyan & Zisserman, 2014) — demonstrated that depth alone buys accuracy. Simple recipe: stacks of 3×3 convs and 2×2 maxpools, nothing fancy. ~140M parameters.
- GoogLeNet / Inception (Szegedy et al., 2014) — introduced the Inception module: parallel 1×1, 3×3, 5×5 convs concatenated along the channel axis. Much more parameter-efficient than VGG.
- ResNet (He et al., 2015) — residual connections ($h_{l+1} = h_l + f(h_l)$) let the network learn identity mappings by default, enabling training of 152-layer networks that had previously been untrainable. The single most influential architecture change of the decade.
- DenseNet (Huang et al., 2017) — every layer receives input from every previous layer (concatenation instead of addition). Efficient reuse of features.
- EfficientNet (Tan & Le, 2019) — principled scaling of depth, width, and resolution together. State-of-the-art accuracy per FLOP for image classification.
- Vision Transformers (2020) — discard convolution entirely and process images as sequences of patches with self-attention. For a while everyone thought this was the end of CNNs; it's actually shaping up to be a convergence: modern hybrids use convolutions early and attention later.
11. Summary
A convolution slides a small learnable filter across an image, computing a dot product at each position. Because the filter weights are shared across all locations, convolutions encode locality and translation invariance directly into the architecture — dramatically reducing parameter count compared to fully-connected layers. Stacked convolutional layers build hierarchical features: early layers detect edges and textures, deeper layers detect parts and objects. The whole thing is trained end-to-end with backpropagation, and it is the reason computers can see.
Where to go next
- → Backpropagation — how gradients flow through a convolutional layer (spoiler: it's also a convolution).
- → Self-Attention — the architecture that started eating CNNs' lunch.
- Stanford CS231n: Convolutional Networks — the classic online course module.
- Distill: Feature Visualization — what filters in a trained CNN actually detect.
- A Guide to Convolution Arithmetic for Deep Learning — the definitive reference for output shape formulas.