The Perceptron

The first learning algorithm that actually worked on a machine. Built in 1957, misunderstood by 1969, and still the atom from which every neural network is built today.

Prereq: dot products, basic geometry Time to read: ~20 min Interactive figures: 2

1. A neuron, shrunk to math

A biological neuron has thousands of dendrites that collect electrical signals from neighboring cells. If the total incoming signal crosses a threshold, the neuron fires: it sends a spike down its axon to downstream neurons. If the total doesn't clear the threshold, it stays silent.

That is the entire biological story the early computer scientists needed. Everything else about neurons — the chemistry, the dendrite morphology, the spike timing — gets abstracted away. What's left is almost embarrassingly simple:

\text{inputs} \xrightarrow{\text{weighted sum}} \text{compare to threshold} \xrightarrow{} \text{output } \{0, 1\}

That abstract neuron is the perceptron. It turns out to be enough to learn to classify data — if the classes are linearly separable. It also turns out not to be enough for anything more complicated, and that fact triggered the first AI winter. Both halves of that story are worth knowing.

2. McCulloch–Pitts (1943)

Before Rosenblatt built his machine, Warren McCulloch (a neurophysiologist) and Walter Pitts (a self-taught logician) published A Logical Calculus of the Ideas Immanent in Nervous Activity in 1943. Their model had no learning at all — it was a purely descriptive device — but it established that a network of threshold units could implement any Boolean logic circuit.

McCulloch–Pitts neuron

A unit receives binary inputs $x_1, \dots, x_n \in \{0,1\}$. Each input has a fixed weight of $+1$ (excitatory) or $-\infty$ (inhibitory). The neuron fires (outputs 1) if and only if no inhibitory input is active AND the sum of excitatory inputs reaches a fixed threshold $\theta$.

You could wire these things up by hand to compute AND, OR, NOT, and any composition thereof. McCulloch and Pitts even showed you could build a rudimentary memory cell. What they did not show was how a network could learn its own weights. That would take fourteen more years.

3. Rosenblatt's perceptron (1957)

In 1957, Frank Rosenblatt at the Cornell Aeronautical Laboratory built the Mark I Perceptron — a physical machine, not a simulation. It had a 20×20 grid of photocells as its "retina," feeding weighted sums into electromechanical relays. The weights were literally potentiometers: little knobs that motors could turn to adjust the strength of each input. When the perceptron made a classification mistake, the motors adjusted the right knobs by a small amount, nudging the machine toward the correct answer. That automatic weight-adjustment procedure — now called the perceptron learning rule — is what turned the McCulloch–Pitts neuron from a description of logic into a learner.

The New York Times ran a story on July 8, 1958 with the headline "New Navy Device Learns by Doing." Rosenblatt told reporters the perceptron would eventually be able to "walk, talk, see, write, reproduce itself and be conscious of its existence." The press ran with this. It would not age well.

WHY IT WAS A BIG DEAL

The perceptron was the first piece of hardware that could adjust its own parameters in response to feedback. Every neural network, every deep learning model, every LLM — all of it descends from that one trick: "when you're wrong, nudge the weights in the direction that would have been right."

4. The model, precisely

Strip away the hardware and the perceptron is this function:

f(x) = \begin{cases} 1 & \text{if } \mathbf{w} \cdot \mathbf{x} + b > 0 \\ 0 & \text{otherwise} \end{cases}

where $\mathbf{x} \in \mathbb{R}^n$ is a feature vector, $\mathbf{w} \in \mathbb{R}^n$ is a weight vector, and $b \in \mathbb{R}$ is a bias (equivalently, the negative of a threshold). The dot product $\mathbf{w} \cdot \mathbf{x} = \sum_i w_i x_i$ is the weighted sum.

The function that maps the weighted sum to $\{0, 1\}$ is called the activation function. Rosenblatt used the step function (a.k.a. the Heaviside function):

\text{step}(z) = \begin{cases} 1 & z > 0 \\ 0 & z \le 0 \end{cases}

Later networks replaced the step function with smoother activations (sigmoid, tanh, ReLU) precisely because the step function is not differentiable — and once you want gradient-based training (i.e. backpropagation), non-differentiable activations become a blocker.

A concrete perceptron

Let $\mathbf{w} = [2,\ -1]$, $b = -1$, $\mathbf{x} = [1.5,\ 0.5]$.

$\mathbf{w} \cdot \mathbf{x} + b = (2)(1.5) + (-1)(0.5) + (-1) = 3 - 0.5 - 1 = 1.5$

$1.5 > 0$, so $f(\mathbf{x}) = 1$. The perceptron classifies $[1.5, 0.5]$ as class 1.

Source — Perceptron Forward Pass

function predict(x, w, b):
    z = dot(w, x) + b
    if z > 0:
        return +1
    else:
        return -1

def predict(x, w, b):
    """Perceptron forward pass.
    x, w: lists or arrays of the same length
    b: scalar bias
    Returns +1 or -1.
    """
    z = sum(wi * xi for wi, xi in zip(w, x)) + b
    return 1 if z > 0 else -1

function predict(x, w, b) {
  // x, w: arrays of the same length; b: scalar bias
  const z = w.reduce((acc, wi, i) => acc + wi * x[i], 0) + b;
  return z > 0 ? 1 : -1;
}

#include <stddef.h>

/* Returns +1 or -1. n is the number of features. */
int predict(const float *x, const float *w, float b, int n) {
    float z = b;
    for (int i = 0; i < n; i++) z += w[i] * x[i];
    return z > 0.0f ? 1 : -1;
}

#include <vector>
#include <numeric>

// Returns +1 or -1.
int predict(const std::vector<float>& x,
            const std::vector<float>& w,
            float b) {
    float z = b;
    for (size_t i = 0; i < w.size(); ++i) z += w[i] * x[i];
    return z > 0.0f ? 1 : -1;
}

public class Perceptron {
    /** Returns +1 or -1. */
    public static int predict(double[] x, double[] w, double b) {
        double z = b;
        for (int i = 0; i < w.length; i++) z += w[i] * x[i];
        return z > 0.0 ? 1 : -1;
    }
}

// Predict returns +1 or -1 for the perceptron forward pass.
func Predict(x, w []float64, b float64) int {
    z := b
    for i, wi := range w {
        z += wi * x[i]
    }
    if z > 0 {
        return 1
    }
    return -1
}

5. Geometric picture

Here is the geometric observation that makes a perceptron both powerful and limited: the equation $\mathbf{w} \cdot \mathbf{x} + b = 0$ defines a hyperplane in $\mathbb{R}^n$. A perceptron outputs 1 on one side of that hyperplane and 0 on the other. It is, literally, drawing a line and calling one side "yes" and the other side "no."

In 2D, the perceptron decision boundary $\mathbf{w} \cdot \mathbf{x} + b = 0$ is a line. The weight vector $\mathbf{w}$ is normal to the line and points into the "output = 1" half-plane.

In $n$ dimensions, "line" becomes "hyperplane" — an $(n-1)$-dimensional flat surface. In 3D, the perceptron's decision surface is an ordinary 2D plane. In $\mathbb{R}^{784}$ (the space of 28×28 MNIST digits), it's a 783-dimensional hyperplane. The math is the same; only the dimension count changes.

Linearly separable

A dataset $\{(\mathbf{x}^{(i)}, y^{(i)})\}$ with $y^{(i)} \in \{0, 1\}$ is linearly separable if there exist $\mathbf{w}, b$ such that $\mathbf{w} \cdot \mathbf{x}^{(i)} + b > 0$ whenever $y^{(i)} = 1$ and $\le 0$ whenever $y^{(i)} = 0$. Equivalently: there is some hyperplane that puts all the class-1 examples on one side and all the class-0 examples on the other.

A perceptron can only solve linearly separable problems. That is both its charm and its death sentence.

6. The learning rule

Here is the genius of Rosenblatt's contribution: a procedure that provably finds a separating hyperplane, if one exists, using only local updates to each weight.

Perceptron learning rule

Cycle through the training examples $(\mathbf{x}^{(i)}, y^{(i)})$. For each one, compute the prediction $\hat{y} = \text{step}(\mathbf{w} \cdot \mathbf{x}^{(i)} + b)$.

If $\hat{y} = y^{(i)}$: do nothing. The perceptron got it right.
If $\hat{y} \ne y^{(i)}$: update $$\mathbf{w} \leftarrow \mathbf{w} + \eta \, (y^{(i)} - \hat{y}) \, \mathbf{x}^{(i)}$$ $$b \leftarrow b + \eta \, (y^{(i)} - \hat{y})$$

Here $\eta > 0$ is the learning rate. Repeat until no more mistakes are made on a full pass through the data.

The term $(y^{(i)} - \hat{y})$ is $+1$ when the perceptron said 0 but should have said 1, and $-1$ in the opposite case. In both cases we nudge $\mathbf{w}$ in the direction of $\mathbf{x}^{(i)}$ (positively or negatively), which rotates the decision boundary so that the next time we see this example we're more likely to get it right.

One update, worked out

Start with $\mathbf{w} = [0, 0]$, $b = 0$, $\eta = 1$.

Training example: $\mathbf{x} = [1, 2]$, $y = 1$.

Prediction: $\mathbf{w} \cdot \mathbf{x} + b = 0$, so $\hat{y} = \text{step}(0) = 0$. Wrong.

Update: $\mathbf{w} \leftarrow [0,0] + 1 \cdot (1 - 0) \cdot [1, 2] = [1, 2]$.

$b \leftarrow 0 + 1 \cdot (1 - 0) = 1$.

Re-check: $\mathbf{w} \cdot \mathbf{x} + b = 1 + 4 + 1 = 6 > 0$, so $\hat{y} = 1$. Now correct.

Interactive: watch a perceptron learn a line

Below you can step through the perceptron learning rule on a 2D toy dataset. Press Step → to process one example at a time. The decision boundary (dashed pink line) rotates after each misclassification. Edit the learning rate to see how the updates change.

▸ Perceptron learning rule on 8 points Pass 1 · Example 0 / 8

READY

Starting weights: $\mathbf{w} = [0, 0]$, $b = 0$. The horizontal dashed line is the initial decision boundary. Blue dots are class 1 and gray dots are class 0.

Press Step → to process the next training example.

η (lr) w₁ w₂ b

Source — Perceptron Learning Rule

function train(X, y, lr, epochs):
    n = length(X[0])   // number of features
    w = zeros(n)
    b = 0
    for epoch in 1..epochs:
        mistakes = 0
        for each (xi, yi) in (X, y):
            yhat = predict(xi, w, b)   // +1 or -1
            if yhat != yi:
                w = w + lr * yi * xi
                b = b + lr * yi
                mistakes += 1
        if mistakes == 0:
            break   // converged
    return w, b

def train(X, y, lr=1.0, epochs=100):
    """Perceptron learning rule.
    X: list of feature vectors (lists)
    y: list of labels in {+1, -1}
    Returns (w, b) after training.
    """
    n = len(X[0])
    w = [0.0] * n
    b = 0.0
    for _ in range(epochs):
        mistakes = 0
        for xi, yi in zip(X, y):
            yhat = 1 if sum(wi * xij for wi, xij in zip(w, xi)) + b > 0 else -1
            if yhat != yi:
                w = [wi + lr * yi * xij for wi, xij in zip(w, xi)]
                b += lr * yi
                mistakes += 1
        if mistakes == 0:
            break
    return w, b

function train(X, y, lr = 1.0, epochs = 100) {
  const n = X[0].length;
  let w = new Array(n).fill(0);
  let b = 0;

  for (let e = 0; e < epochs; e++) {
    let mistakes = 0;
    for (let i = 0; i < X.length; i++) {
      const z = w.reduce((s, wi, j) => s + wi * X[i][j], 0) + b;
      const yhat = z > 0 ? 1 : -1;
      if (yhat !== y[i]) {
        w = w.map((wi, j) => wi + lr * y[i] * X[i][j]);
        b += lr * y[i];
        mistakes++;
      }
    }
    if (mistakes === 0) break;
  }
  return { w, b };
}

#include <string.h>

/* X: m*n row-major; y: m labels (+1/-1); w_out: n weights; b_out: bias */
void train(const float *X, const int *y, int m, int n,
           float lr, int epochs, float *w_out, float *b_out) {
    float w[n];
    memset(w, 0, sizeof(float) * n);
    float b = 0.0f;

    for (int e = 0; e < epochs; e++) {
        int mistakes = 0;
        for (int i = 0; i < m; i++) {
            float z = b;
            for (int j = 0; j < n; j++) z += w[j] * X[i*n + j];
            int yhat = z > 0.0f ? 1 : -1;
            if (yhat != y[i]) {
                for (int j = 0; j < n; j++)
                    w[j] += lr * y[i] * X[i*n + j];
                b += lr * y[i];
                mistakes++;
            }
        }
        if (mistakes == 0) break;
    }
    for (int j = 0; j < n; j++) w_out[j] = w[j];
    *b_out = b;
}

#include <vector>

std::pair<std::vector<float>, float>
train(const std::vector<std::vector<float>>& X,
      const std::vector<int>& y,
      float lr = 1.0f, int epochs = 100) {
    int n = X[0].size();
    std::vector<float> w(n, 0.0f);
    float b = 0.0f;

    for (int e = 0; e < epochs; e++) {
        int mistakes = 0;
        for (size_t i = 0; i < X.size(); i++) {
            float z = b;
            for (int j = 0; j < n; j++) z += w[j] * X[i][j];
            int yhat = z > 0.0f ? 1 : -1;
            if (yhat != y[i]) {
                for (int j = 0; j < n; j++)
                    w[j] += lr * y[i] * X[i][j];
                b += lr * y[i];
                mistakes++;
            }
        }
        if (mistakes == 0) break;
    }
    return {w, b};
}

public class Perceptron {
    public double[] w;
    public double b;

    public void train(double[][] X, int[] y, double lr, int epochs) {
        int m = X.length, n = X[0].length;
        w = new double[n];
        b = 0.0;
        for (int e = 0; e < epochs; e++) {
            int mistakes = 0;
            for (int i = 0; i < m; i++) {
                double z = b;
                for (int j = 0; j < n; j++) z += w[j] * X[i][j];
                int yhat = z > 0 ? 1 : -1;
                if (yhat != y[i]) {
                    for (int j = 0; j < n; j++)
                        w[j] += lr * y[i] * X[i][j];
                    b += lr * y[i];
                    mistakes++;
                }
            }
            if (mistakes == 0) break;
        }
    }
}

// Train runs the perceptron learning rule.
// X is m x n, y contains +1/-1 labels.
// Returns weights w and bias b.
func Train(X [][]float64, y []int, lr float64, epochs int) ([]float64, float64) {
    n := len(X[0])
    w := make([]float64, n)
    var b float64

    for e := 0; e < epochs; e++ {
        mistakes := 0
        for i, xi := range X {
            z := b
            for j, wj := range w {
                z += wj * xi[j]
            }
            yhat := 1
            if z <= 0 {
                yhat = -1
            }
            if yhat != y[i] {
                for j := range w {
                    w[j] += lr * float64(y[i]) * xi[j]
                }
                b += lr * float64(y[i])
                mistakes++
            }
        }
        if mistakes == 0 {
            break
        }
    }
    return w, b
}

7. The convergence theorem

Rosenblatt and others proved something remarkable about this simple rule: if the data is linearly separable, the perceptron learning rule will find a separating hyperplane in a finite number of updates. You don't just hope it will work; you can bound how long it will take.

Perceptron convergence theorem (Novikoff, 1962)

Let $\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m$ be a dataset with $y^{(i)} \in \{-1, +1\}$ and $\|\mathbf{x}^{(i)}\| \le R$ for all $i$. Suppose there exists a unit vector $\mathbf{w}^*$ and a positive margin $\gamma > 0$ such that for all $i$:

y^{(i)} \, (\mathbf{w}^* \cdot \mathbf{x}^{(i)}) \ge \gamma

Then the number of updates the perceptron algorithm makes, starting from $\mathbf{w} = 0$, is at most

\left(\frac{R}{\gamma}\right)^2

Proof sketch

Let $\mathbf{w}_k$ be the weight vector after $k$ updates. We track two quantities.

Lower bound on $\mathbf{w}_k \cdot \mathbf{w}^*$: each update adds $y \mathbf{x}$ to $\mathbf{w}$, and by the margin condition $y \mathbf{x} \cdot \mathbf{w}^* \ge \gamma$. So $\mathbf{w}_k \cdot \mathbf{w}^* \ge k \gamma$.

Upper bound on $\|\mathbf{w}_k\|^2$: updates happen only when we make a mistake, i.e. when $y(\mathbf{w}_{k-1} \cdot \mathbf{x}) \le 0$. So $\|\mathbf{w}_k\|^2 = \|\mathbf{w}_{k-1}\|^2 + 2 y \mathbf{w}_{k-1} \cdot \mathbf{x} + \|\mathbf{x}\|^2 \le \|\mathbf{w}_{k-1}\|^2 + R^2$. Hence $\|\mathbf{w}_k\|^2 \le k R^2$.

Combining with the Cauchy–Schwarz inequality:

$k \gamma \le \mathbf{w}_k \cdot \mathbf{w}^* \le \|\mathbf{w}_k\| \le \sqrt{k} R$

Dividing both sides by $\sqrt{k}$ and squaring: $k \le (R/\gamma)^2$. $\blacksquare$

Notice what the bound depends on: the ratio of the data's "size" $R$ to the margin $\gamma$. Wide margin (easy problem) → few updates. Narrow margin (hard problem) → many updates. Non-separable data → the bound does not apply, and the perceptron cycles forever.

8. The XOR problem

Consider the four points in 2D corresponding to the XOR truth table:

\begin{array}{|c|c|c|} \hline x_1 & x_2 & y = x_1 \oplus x_2 \\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ \hline \end{array}

Draw these four points. The $y = 1$ points are on one diagonal, the $y = 0$ points are on the other. There is no single straight line you can draw in the plane that puts $(0,1)$ and $(1,0)$ on one side and $(0,0)$ and $(1,1)$ on the other. XOR is not linearly separable.

XOR: no straight line separates the blue (class 1) points from the gray (class 0) points. A single perceptron cannot learn it.

But the XOR function can be computed as (A OR B) AND NOT (A AND B). Both AND and OR are linearly separable. So if you compose two layers of perceptrons — one hidden layer computing AND and OR, one output layer combining them — you can represent XOR. This is exactly what a two-layer neural network does, and it's the first hint that depth matters.

XOR with two layers

Hidden unit 1: $h_1 = \text{step}(x_1 + x_2 - 0.5)$ (this is OR)

Hidden unit 2: $h_2 = \text{step}(x_1 + x_2 - 1.5)$ (this is AND)

Output: $\hat{y} = \text{step}(h_1 - h_2 - 0.5)$ (this is "$h_1$ but not $h_2$")

Verify all four inputs: $(0,0) \to 0$, $(0,1) \to 1$, $(1,0) \to 1$, $(1,1) \to 0$. Exactly XOR.

Source — MLP Forward Pass

// 2-layer MLP: input -> hidden -> output
// sigmoid(z) = 1 / (1 + exp(-z))

function mlp_forward(x, W1, b1, W2, b2):
    // Hidden layer
    z1 = matmul(W1, x) + b1
    a1 = sigmoid(z1)        // element-wise sigmoid

    // Output layer
    z2 = matmul(W2, a1) + b2
    a2 = sigmoid(z2)        // final activation
    return a2

import math

def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-z))

def matvec(W, x):
    """Multiply matrix W (list of rows) by vector x."""
    return [sum(Wij * xj for Wij, xj in zip(row, x)) for row in W]

def vec_add(a, b):
    return [ai + bi for ai, bi in zip(a, b)]

def mlp_forward(x, W1, b1, W2, b2):
    """2-layer MLP with sigmoid activations.
    x:  input vector (n,)
    W1: hidden weight matrix (h, n)
    b1: hidden bias (h,)
    W2: output weight matrix (o, h)
    b2: output bias (o,)
    Returns output vector (o,).
    """
    z1 = vec_add(matvec(W1, x), b1)
    a1 = [sigmoid(z) for z in z1]
    z2 = vec_add(matvec(W2, a1), b2)
    a2 = [sigmoid(z) for z in z2]
    return a2

function sigmoid(z) { return 1 / (1 + Math.exp(-z)); }

// W: 2-D array [rows][cols], x: 1-D array
function matvec(W, x) {
  return W.map(row => row.reduce((s, w, j) => s + w * x[j], 0));
}

function mlpForward(x, W1, b1, W2, b2) {
  const z1 = matvec(W1, x).map((v, i) => v + b1[i]);
  const a1 = z1.map(sigmoid);
  const z2 = matvec(W2, a1).map((v, i) => v + b2[i]);
  const a2 = z2.map(sigmoid);
  return a2;
}

#include <math.h>
#include <string.h>

static float sigmoid(float z) { return 1.0f / (1.0f + expf(-z)); }

/* 2-layer MLP forward pass.
   n=input dim, h=hidden dim, o=output dim.
   W1[h*n], b1[h], W2[o*h], b2[o], a2_out[o]. */
void mlp_forward(const float *x, int n,
                 const float *W1, const float *b1, int h,
                 const float *W2, const float *b2, int o,
                 float *a2_out) {
    float a1[h];
    // Hidden layer
    for (int i = 0; i < h; i++) {
        float z = b1[i];
        for (int j = 0; j < n; j++) z += W1[i*n + j] * x[j];
        a1[i] = sigmoid(z);
    }
    // Output layer
    for (int i = 0; i < o; i++) {
        float z = b2[i];
        for (int j = 0; j < h; j++) z += W2[i*h + j] * a1[j];
        a2_out[i] = sigmoid(z);
    }
}

#include <vector>
#include <cmath>

static float sigmoid(float z) { return 1.0f / (1.0f + std::exp(-z)); }

std::vector<float> mlpForward(
    const std::vector<float>& x,
    const std::vector<std::vector<float>>& W1, const std::vector<float>& b1,
    const std::vector<std::vector<float>>& W2, const std::vector<float>& b2) {

    // Hidden layer
    std::vector<float> a1(W1.size());
    for (size_t i = 0; i < W1.size(); i++) {
        float z = b1[i];
        for (size_t j = 0; j < x.size(); j++) z += W1[i][j] * x[j];
        a1[i] = sigmoid(z);
    }
    // Output layer
    std::vector<float> a2(W2.size());
    for (size_t i = 0; i < W2.size(); i++) {
        float z = b2[i];
        for (size_t j = 0; j < a1.size(); j++) z += W2[i][j] * a1[j];
        a2[i] = sigmoid(z);
    }
    return a2;
}

public class MLP {
    private static double sigmoid(double z) {
        return 1.0 / (1.0 + Math.exp(-z));
    }

    /** 2-layer MLP forward pass.
     *  W1[h][n], b1[h], W2[o][h], b2[o]. Returns a2[o]. */
    public static double[] forward(double[] x,
                                    double[][] W1, double[] b1,
                                    double[][] W2, double[] b2) {
        int h = W1.length, o = W2.length;
        double[] a1 = new double[h];
        for (int i = 0; i < h; i++) {
            double z = b1[i];
            for (int j = 0; j < x.length; j++) z += W1[i][j] * x[j];
            a1[i] = sigmoid(z);
        }
        double[] a2 = new double[o];
        for (int i = 0; i < o; i++) {
            double z = b2[i];
            for (int j = 0; j < h; j++) z += W2[i][j] * a1[j];
            a2[i] = sigmoid(z);
        }
        return a2;
    }
}

import "math"

func sigmoid(z float64) float64 { return 1.0 / (1.0 + math.Exp(-z)) }

// MLPForward runs a 2-layer MLP (hidden + output) with sigmoid activations.
// W1 is h x n, W2 is o x h.
func MLPForward(x []float64, W1 [][]float64, b1 []float64,
                W2 [][]float64, b2 []float64) []float64 {
    h := len(W1)
    a1 := make([]float64, h)
    for i, row := range W1 {
        z := b1[i]
        for j, wij := range row {
            z += wij * x[j]
        }
        a1[i] = sigmoid(z)
    }
    o := len(W2)
    a2 := make([]float64, o)
    for i, row := range W2 {
        z := b2[i]
        for j, wij := range row {
            z += wij * a1[j]
        }
        a2[i] = sigmoid(z)
    }
    return a2
}

9. The winter of 1969

In 1969, Marvin Minsky and Seymour Papert (MIT) published Perceptrons: An Introduction to Computational Geometry. The book proved rigorously that a single-layer perceptron cannot learn XOR (and more generally, any function whose positive region is not a single convex set). It also pointed out that many naturally interesting functions — like whether a set of pixels forms a connected shape — fall outside what a single layer can represent.

The book's conclusions about multi-layer perceptrons were more careful than the popular retelling suggests. Minsky and Papert explicitly noted that deeper networks should, in principle, be much more powerful — but worried that no learning procedure was known for them. (Backpropagation existed in various forms, but nobody yet appreciated that it was the solution.)

The funding agencies took the pessimistic reading. The Mansfield Amendment of 1969 required military research to have clear defense applications; the DARPA AI budget was slashed; and between roughly 1969 and 1980, neural network research all but stopped in the United States. We now call this the first AI winter. It didn't end until backpropagation was rediscovered and popularized in 1986 by Rumelhart, Hinton, and Williams — and even then, it took another two decades before deep networks routinely outperformed classical methods.

LESSON

The perceptron was not wrong. The limitation was real — a single layer really cannot learn XOR — but the field over-generalized from "this architecture has limits" to "neural networks are a dead end." A more complete reading would have said: "we need more layers, and we need to learn how to train them." Both obstacles were eventually removed, but at the cost of a lost decade.

10. Why it still matters

Modern deep learning does not use step functions or the perceptron learning rule. It uses differentiable activations and gradient descent via backprop. But every single neuron in every layer of every modern network is, at heart, still doing the same thing: taking a weighted sum of its inputs and passing it through a nonlinearity. The formula $\mathbf{w} \cdot \mathbf{x} + b$ is literally the first line of code in every layer of every network, everywhere.

More importantly, the perceptron established three ideas that all of deep learning builds on:

Loss-driven adjustment. Compare the model's output to a target, use the error to change the parameters. Every training loop does this.
Locally simple, globally powerful. Each unit does almost nothing. Stack enough of them together and you can learn to classify handwritten digits, translate languages, generate images.
Geometry matters. Linear separability in the raw feature space is a strong constraint. Deep networks can be seen as learning a new representation in which the data becomes linearly separable — the final layer is still essentially a perceptron, but the features it sees have been transformed by all the layers below it.

Source — Activation Functions

// All functions applied element-wise to a scalar z.

sigmoid(z)     = 1 / (1 + exp(-z))          // output in (0, 1)
tanh(z)        = (exp(z) - exp(-z)) /
                 (exp(z) + exp(-z))           // output in (-1, 1)
relu(z)        = max(0, z)                   // output in [0, +inf)
leaky_relu(z, alpha=0.01)
               = z      if z >= 0
                 alpha*z if z < 0            // alpha << 1, small negative slope

import math

def sigmoid(z):
    return 1.0 / (1.0 + math.exp(-z))

def tanh_act(z):
    return math.tanh(z)

def relu(z):
    return max(0.0, z)

def leaky_relu(z, alpha=0.01):
    return z if z >= 0 else alpha * z

# --- Derivatives (needed for backprop) ---

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1.0 - s)

def tanh_deriv(z):
    return 1.0 - math.tanh(z) ** 2

def relu_deriv(z):
    return 1.0 if z > 0 else 0.0

def leaky_relu_deriv(z, alpha=0.01):
    return 1.0 if z >= 0 else alpha

const sigmoid     = z => 1 / (1 + Math.exp(-z));
const tanhAct     = z => Math.tanh(z);
const relu        = z => Math.max(0, z);
const leakyRelu   = (z, alpha = 0.01) => z >= 0 ? z : alpha * z;

// Derivatives
const sigmoidD    = z => { const s = sigmoid(z); return s * (1 - s); };
const tanhD       = z => 1 - Math.tanh(z) ** 2;
const reluD       = z => z > 0 ? 1 : 0;
const leakyReluD  = (z, alpha = 0.01) => z >= 0 ? 1 : alpha;

#include <math.h>

float sigmoid(float z)   { return 1.0f / (1.0f + expf(-z)); }
float tanh_act(float z)  { return tanhf(z); }
float relu(float z)      { return z > 0.0f ? z : 0.0f; }
float leaky_relu(float z, float alpha) { return z >= 0.0f ? z : alpha * z; }

/* Derivatives */
float sigmoid_d(float z) { float s = sigmoid(z); return s * (1.0f - s); }
float tanh_d(float z)    { float t = tanhf(z); return 1.0f - t * t; }
float relu_d(float z)    { return z > 0.0f ? 1.0f : 0.0f; }
float leaky_relu_d(float z, float alpha) { return z >= 0.0f ? 1.0f : alpha; }

#include <cmath>
#include <algorithm>

inline float sigmoid(float z)   { return 1.0f / (1.0f + std::exp(-z)); }
inline float tanhAct(float z)   { return std::tanh(z); }
inline float relu(float z)      { return std::max(0.0f, z); }
inline float leakyRelu(float z, float alpha = 0.01f) {
    return z >= 0.0f ? z : alpha * z;
}

// Derivatives
inline float sigmoidD(float z)  { float s = sigmoid(z); return s * (1.0f - s); }
inline float tanhD(float z)     { float t = std::tanh(z); return 1.0f - t * t; }
inline float reluD(float z)     { return z > 0.0f ? 1.0f : 0.0f; }
inline float leakyReluD(float z, float alpha = 0.01f) {
    return z >= 0.0f ? 1.0f : alpha;
}

public class Activations {
    public static double sigmoid(double z)  { return 1.0 / (1.0 + Math.exp(-z)); }
    public static double tanh(double z)     { return Math.tanh(z); }
    public static double relu(double z)     { return Math.max(0.0, z); }
    public static double leakyRelu(double z, double alpha) {
        return z >= 0 ? z : alpha * z;
    }

    // Derivatives
    public static double sigmoidD(double z) { double s = sigmoid(z); return s * (1 - s); }
    public static double tanhD(double z)    { double t = Math.tanh(z); return 1 - t * t; }
    public static double reluD(double z)    { return z > 0 ? 1.0 : 0.0; }
    public static double leakyReluD(double z, double alpha) {
        return z >= 0 ? 1.0 : alpha;
    }
}

import "math"

func Sigmoid(z float64) float64 { return 1.0 / (1.0 + math.Exp(-z)) }
func TanhAct(z float64) float64 { return math.Tanh(z) }
func ReLU(z float64) float64    { return math.Max(0, z) }
func LeakyReLU(z, alpha float64) float64 {
    if z >= 0 { return z }
    return alpha * z
}

// Derivatives
func SigmoidD(z float64) float64 { s := Sigmoid(z); return s * (1 - s) }
func TanhD(z float64) float64    { t := math.Tanh(z); return 1 - t*t }
func ReLUD(z float64) float64    { if z > 0 { return 1 }; return 0 }
func LeakyReLUD(z, alpha float64) float64 {
    if z >= 0 { return 1 }
    return alpha
}

11. Summary

ONE-PARAGRAPH SUMMARY

A perceptron is a linear classifier: output 1 if $\mathbf{w} \cdot \mathbf{x} + b > 0$, else 0. Rosenblatt's learning rule — "on a mistake, add $\eta y \mathbf{x}$ to $\mathbf{w}$" — provably finds a separating hyperplane in finite time when the data is linearly separable. The perceptron cannot represent XOR, which famously froze the field for a decade. But every layer of every modern neural network still begins with the same weighted sum, and everything else is built on top of it.

Where to go next

→ Gradient Descent — how modern networks update their weights when the activation is smooth.
→ Backpropagation — how to apply gradient descent through multiple layers, lifting the XOR curse.
→ Convolution — perceptrons with shared weights and spatial structure.
Wikipedia: Perceptron — historical photos of the Mark I and more on Rosenblatt.
Minsky & Papert, Perceptrons — the 1969 book that caused the winter.