Self-Attention
The core operation of the transformer. For every token in a sequence, look at every other token, decide how much each of them matters right now, and mix their information accordingly.
1. The idea in one sentence
Given a sequence of token vectors $x_1, x_2, \dots, x_n$, self-attention produces a new sequence $y_1, y_2, \dots, y_n$ in which each $y_i$ is a weighted combination of all the input vectors. The weights are computed dynamically from the inputs themselves — there are no fixed connections like in a CNN, no recurrent state like in an RNN. Every token looks at every other token, every time.
Language is contextual. The meaning of "bank" depends on whether the surrounding words are "river" or "money." Self-attention gives every token direct access to every other token — the network doesn't have to pass information through a long chain of hidden states. Dependencies are a single step away, no matter how far apart in the sequence.
The operation has three inputs per token (query, key, value), three linear projections, one softmax, and one weighted sum. That's it. Strip away the notation and self-attention is remarkably short. The entire transformer architecture is built out of stacked self-attention layers plus ordinary feedforward layers, and this combination now powers every frontier LLM.
2. Why attention replaced RNNs
Before 2017, the default architecture for sequences was the recurrent neural network (RNN), usually an LSTM. An RNN processes tokens left-to-right, carrying a hidden state vector that is updated at each step. This has two big problems:
- Long-range dependencies are hard. To connect the word at position 100 with information from position 1, the signal has to pass through 99 hidden-state updates. Gradients vanish or explode along the way, and even with LSTMs the effective context length is limited.
- It's inherently sequential. You cannot compute the hidden state at time $t$ until you've computed it at time $t-1$. GPUs are parallel machines, and this blocks parallelism at the core of training.
Attention was invented earlier (Bahdanau et al., 2014) as a mechanism to augment an RNN encoder-decoder, letting the decoder peek back at the encoder's hidden states. The 2017 paper Attention Is All You Need (Vaswani et al.) took the next step: throw out the RNN entirely and use only attention. The result — the Transformer — trains in parallel across sequence length and handles long-range dependencies in a single step.
3. Queries, keys, values
The mental model borrowed from classical information retrieval: imagine a database of (key, value) pairs, and you want to look something up. You issue a query; the system compares your query against every key, picks the closest match, and returns the corresponding value.
Self-attention does a soft version of this. Instead of picking one best match, it assigns a weight to every (key, value) pair based on how well it matches the query, and returns the weighted average of all values. The weights come from a softmax, so they sum to 1.
For each input token $x_i$, we compute three vectors by linear projection:
Here $W_Q, W_K, W_V$ are learnable matrices shared across all token positions. The query $q_i$ is "what is this token looking for?"; the key $k_j$ is "what does token $j$ have to offer?"; the value $v_j$ is "what information token $j$ contributes, if selected."
Three things to notice:
- Every token produces all three vectors. That's why it's called "self" attention — the same sequence supplies the queries, the keys, and the values.
- The projection matrices are shared across positions. Position-specific behavior comes from positional encoding, not from position-specific weights.
- Each of $q, k, v$ typically has a smaller dimension $d_k$ than the input $d_\text{model}$. For a transformer with $d_\text{model} = 512$ and 8 heads, $d_k = 64$.
4. Scaled dot-product attention
With $q, k, v$ in hand, the attention output for token $i$ is:
Reading that from the inside out:
- $s_{ij}$ is the raw score measuring how well token $i$'s query matches token $j$'s key: just a dot product, scaled by $\sqrt{d_k}$.
- $\alpha_{ij}$ is the attention weight — the softmax over all keys, producing a probability distribution over positions.
- $y_i$ is the output — a weighted average of the value vectors, weighted by $\alpha$.
Why the $\sqrt{d_k}$ divide?
If $q$ and $k$ have independent components with mean 0 and variance 1, then the dot product $q \cdot k = \sum_{l=1}^{d_k} q_l k_l$ has mean 0 and variance $d_k$ — the scale grows with the dimension. Large scores push the softmax into a regime where one value dominates and the gradient is nearly zero. Dividing by $\sqrt{d_k}$ rescales the variance back to 1, keeping the softmax in its useful range.
Let $d_k = 2$, and suppose three tokens have queries, keys, values:
$q_1 = [1, 0],\ q_2 = [0, 1],\ q_3 = [1, 1]$
$k_1 = [1, 0],\ k_2 = [0, 1],\ k_3 = [1, 1]$
$v_1 = [2, 0],\ v_2 = [0, 3],\ v_3 = [1, 1]$
Compute $y_3$ (the attention output for token 3):
Raw scores: $q_3 \cdot k_1 = 1$, $q_3 \cdot k_2 = 1$, $q_3 \cdot k_3 = 2$
Scaled: divide by $\sqrt{2} \approx 1.414$: $s = [0.707,\ 0.707,\ 1.414]$
Exp: $[2.028,\ 2.028,\ 4.113]$, sum $= 8.169$
Softmax: $\alpha_3 \approx [0.248,\ 0.248,\ 0.504]$
$y_3 = 0.248 \cdot [2, 0] + 0.248 \cdot [0, 3] + 0.504 \cdot [1, 1] \approx [0.999,\ 1.247]$
5. Matrix form
Computing this for every token independently would be slow. Instead we stack all $n$ queries into a matrix $Q \in \mathbb{R}^{n \times d_k}$, all keys into $K \in \mathbb{R}^{n \times d_k}$, all values into $V \in \mathbb{R}^{n \times d_v}$, and compute the whole thing as three matrix operations:
The attention equation, symbol by symbol
- $n$
- Number of tokens in the sequence (e.g. 512 tokens of a paragraph).
- $d_\text{model}$
- Width of the model's residual stream — dimension of each token's vector before projection. Typically 512–4096.
- $d_k, d_v$
- Dimension of the query/key space and the value space respectively. Often $d_k = d_v = d_\text{model} / h$ where $h$ is the number of heads.
- $Q \in \mathbb{R}^{n \times d_k}$
- Query matrix. Each row $q_i$ encodes "what token $i$ is looking for." Computed as $Q = X W_Q$ from the input $X$ via the learned projection $W_Q$.
- $K \in \mathbb{R}^{n \times d_k}$
- Key matrix. Each row $k_j$ encodes "what token $j$ advertises about itself." Computed as $K = X W_K$. Separate from the values — a token can advertise one thing and hand out another.
- $V \in \mathbb{R}^{n \times d_v}$
- Value matrix. Each row $v_j$ is the actual content token $j$ contributes to whoever attends to it. Computed as $V = X W_V$.
- $QK^\top$
- Shape $n \times n$. Entry $(i,j)$ is the dot product $q_i \cdot k_j$ — how well token $i$'s query matches token $j$'s key. This is a similarity matrix, one row per query token, one column per key token.
- $\sqrt{d_k}$
- Scaling factor. Dot products of $d_k$-dimensional random vectors have variance that grows with $d_k$; dividing by $\sqrt{d_k}$ keeps them on a stable scale before the softmax. Without it, large $d_k$ pushes the softmax into extreme (near one-hot) regions where gradients vanish.
- $\text{softmax}(\cdot)$
- Applied row-wise. Converts each row of logits into a probability distribution that sums to 1. This is what turns "raw similarity" into "attention weights" — a soft pointer over the whole sequence.
- $\text{softmax}(QK^\top/\sqrt{d_k}) \, V$
- Final output, shape $n \times d_v$. Row $i$ is a convex combination of value vectors, weighted by how much token $i$ attends to each other token. Token $i$'s new representation is "a mixture of everyone's values, with proportions decided by my query against their keys."
Analogy Think of a seminar room. Every person in the room has (a) a question they're holding in their head ($q_i$), (b) a name tag describing their expertise ($k_j$), and (c) actual knowledge they can share ($v_j$). Token $i$ looks around the room, reads every name tag, and computes a compatibility score between its own question and each tag. The softmax turns those scores into percentages — "I'm 60% interested in person A, 30% in B, 10% in C." Token $i$'s new understanding is 60% of A's knowledge + 30% of B's + 10% of C's. Every token does this for itself in parallel. That's the entire self-attention mechanism.
Step by step with shapes:
- $Q K^\top$ has shape $n \times n$. Row $i$ column $j$ is $q_i \cdot k_j$.
- Scaling and softmax (row-wise) gives the attention matrix $A \in \mathbb{R}^{n \times n}$. Row $i$ is a probability distribution over all source positions.
- $A V$ has shape $n \times d_v$. Row $i$ is the output for token $i$.
That's it. A full self-attention layer is three matrix multiplications and a softmax — which is why it maps so well onto GPU hardware.
Source — Scaled Dot-Product Attention
function scaled_dot_product_attention(Q, K, V):
# Q: [n, d_k] K: [n, d_k] V: [n, d_v]
d_k = number of columns in K
scores = Q * K^T / sqrt(d_k) # [n, n]
weights = softmax(scores, axis=-1) # row-wise softmax
output = weights * V # [n, d_v]
return output
function softmax(x, axis):
x = x - max(x, axis) # numerical stability
e = exp(x)
return e / sum(e, axis)import numpy as np
def softmax(x):
# x: [..., n] — applied along last axis
x = x - x.max(axis=-1, keepdims=True)
e = np.exp(x)
return e / e.sum(axis=-1, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: (n, d_k)
K: (n, d_k)
V: (n, d_v)
mask: optional (n, n) boolean array; True = keep, False = mask out
Returns: (n, d_v)
"""
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (n, n)
if mask is not None:
scores = np.where(mask, scores, -1e9)
weights = softmax(scores) # (n, n)
return weights @ V # (n, d_v)function softmax(row) {
const max = Math.max(...row);
const e = row.map(x => Math.exp(x - max));
const s = e.reduce((a, b) => a + b, 0);
return e.map(x => x / s);
}
function matMul(A, B) {
// A: [m, k] B: [k, n] returns [m, n]
const m = A.length, k = B.length, n = B[0].length;
return Array.from({length: m}, (_, i) =>
Array.from({length: n}, (_, j) =>
A[i].reduce((s, _, l) => s + A[i][l] * B[l][j], 0)));
}
function transpose(M) {
return M[0].map((_, j) => M.map(row => row[j]));
}
function scaledDotProductAttention(Q, K, V, mask) {
const dK = K[0].length;
const KT = transpose(K);
const scores = matMul(Q, KT).map(row => row.map(s => s / Math.sqrt(dK)));
if (mask) {
for (let i = 0; i < scores.length; i++)
for (let j = 0; j < scores[i].length; j++)
if (!mask[i][j]) scores[i][j] = -1e9;
}
const weights = scores.map(softmax);
return matMul(weights, V);
}#include <math.h>
#include <stdlib.h>
/* Q[n][dk], K[n][dk], V[n][dv], out[n][dv] */
void scaled_dot_product_attention(
int n, int dk, int dv,
double Q[n][dk], double K[n][dk], double V[n][dv],
double out[n][dv])
{
double scores[n][n];
double sqrt_dk = sqrt((double)dk);
/* scores = Q * K^T / sqrt(dk) */
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++) {
double s = 0;
for (int l = 0; l < dk; l++) s += Q[i][l] * K[j][l];
scores[i][j] = s / sqrt_dk;
}
/* row-wise softmax */
for (int i = 0; i < n; i++) {
double mx = scores[i][0];
for (int j = 1; j < n; j++) if (scores[i][j] > mx) mx = scores[i][j];
double sum = 0;
for (int j = 0; j < n; j++) { scores[i][j] = exp(scores[i][j] - mx); sum += scores[i][j]; }
for (int j = 0; j < n; j++) scores[i][j] /= sum;
}
/* out = scores * V */
for (int i = 0; i < n; i++)
for (int d = 0; d < dv; d++) {
double s = 0;
for (int j = 0; j < n; j++) s += scores[i][j] * V[j][d];
out[i][d] = s;
}
}#include <vector>
#include <cmath>
#include <algorithm>
using namespace std;
using Mat = vector<vector<double>>;
Mat matMul(const Mat& A, const Mat& B) {
int m = A.size(), k = B.size(), n = B[0].size();
Mat C(m, vector<double>(n, 0));
for (int i = 0; i < m; i++)
for (int l = 0; l < k; l++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][l] * B[l][j];
return C;
}
Mat transpose(const Mat& M) {
Mat T(M[0].size(), vector<double>(M.size()));
for (int i = 0; i < (int)M.size(); i++)
for (int j = 0; j < (int)M[0].size(); j++)
T[j][i] = M[i][j];
return T;
}
Mat scaledDotProductAttention(const Mat& Q, const Mat& K, const Mat& V) {
double sqrt_dk = sqrt((double)K[0].size());
Mat scores = matMul(Q, transpose(K));
for (auto& row : scores) {
double mx = *max_element(row.begin(), row.end());
double s = 0;
for (auto& x : row) { x = exp((x / sqrt_dk) - mx); s += x; }
for (auto& x : row) x /= s;
}
return matMul(scores, V);
}public class ScaledDotProductAttention {
static double[][] matMul(double[][] A, double[][] B) {
int m = A.length, k = B.length, n = B[0].length;
double[][] C = new double[m][n];
for (int i = 0; i < m; i++)
for (int l = 0; l < k; l++)
for (int j = 0; j < n; j++)
C[i][j] += A[i][l] * B[l][j];
return C;
}
static double[][] transpose(double[][] M) {
double[][] T = new double[M[0].length][M.length];
for (int i = 0; i < M.length; i++)
for (int j = 0; j < M[0].length; j++)
T[j][i] = M[i][j];
return T;
}
public static double[][] attention(double[][] Q, double[][] K, double[][] V) {
double sqrtDk = Math.sqrt(K[0].length);
double[][] KT = transpose(K);
double[][] scores = matMul(Q, KT);
for (double[] row : scores) {
double max = Double.NEGATIVE_INFINITY;
for (double v : row) if (v > max) max = v;
double sum = 0;
for (int j = 0; j < row.length; j++) { row[j] = Math.exp(row[j] / sqrtDk - max); sum += row[j]; }
for (int j = 0; j < row.length; j++) row[j] /= sum;
}
return matMul(scores, V);
}
}package attention
import "math"
func matMul(A, B [][]float64) [][]float64 {
m, k, n := len(A), len(B), len(B[0])
C := make([][]float64, m)
for i := range C {
C[i] = make([]float64, n)
for l := 0; l < k; l++ {
for j := 0; j < n; j++ {
C[i][j] += A[i][l] * B[l][j]
}
}
}
return C
}
func transpose(M [][]float64) [][]float64 {
rows, cols := len(M), len(M[0])
T := make([][]float64, cols)
for j := range T {
T[j] = make([]float64, rows)
for i := 0; i < rows; i++ {
T[j][i] = M[i][j]
}
}
return T
}
func ScaledDotProductAttention(Q, K, V [][]float64) [][]float64 {
sqrtDk := math.Sqrt(float64(len(K[0])))
scores := matMul(Q, transpose(K))
for _, row := range scores {
mx := row[0]
for _, v := range row { if v > mx { mx = v } }
sum := 0.0
for j := range row { row[j] = math.Exp(row[j]/sqrtDk - mx); sum += row[j] }
for j := range row { row[j] /= sum }
}
return matMul(scores, V)
}6. Interactive: attention over 3 tokens
Below is a step-through of self-attention for a 3-token sequence. Press Step → to compute each stage: raw dot products, scaled scores, softmax weights, and the final weighted sum. Edit any of the Q, K, V entries to see how the attention distribution changes. Watch what happens when one query is strongly aligned with one key.
READY
We'll compute $y_2$ — the self-attention output for token "cat." Its query vector will be compared to each token's key, producing attention weights that combine the values.
Press Step → to begin.
7. Multi-head attention
A single attention operation produces a single "view" of the sequence — one set of attention weights that has to capture every relationship the network cares about. In practice, we want multiple views: one head might track subject-verb agreement, another might track long-range coreference, another might track local syntax.
Multi-head attention runs $h$ parallel attention operations on different linear projections of the same input, then concatenates the results and projects them back:
Each head uses its own projection matrices $W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_\text{model} \times d_k}$, with $d_k = d_\text{model} / h$ so the total parameter count stays constant. After concatenation, a final linear projection $W^O \in \mathbb{R}^{d_\text{model} \times d_\text{model}}$ mixes the heads back together.
In the original Transformer: $d_\text{model} = 512$, $h = 8$ heads, so each head has $d_k = d_v = 64$. GPT-3 uses 96 heads with $d_k = 128$ and $d_\text{model} = 12{,}288$. Modern LLMs use dozens to hundreds of heads per layer, and dozens of layers.
Empirically, different heads really do specialize. Visualizing attention patterns in a trained transformer reveals heads that attend to the previous token, heads that attend to the matching bracket, heads that attend to the subject of the sentence. Not all heads are interpretable, but enough of them are to make the story compelling.
Source — Multi-Head Attention
function multi_head_attention(X, h, W_Q[h], W_K[h], W_V[h], W_O):
# X: [n, d_model]
# h: number of heads
# W_Q[i], W_K[i], W_V[i]: [d_model, d_k] where d_k = d_model / h
# W_O: [h*d_k, d_model]
heads = []
for i in 0..h-1:
Q_i = X * W_Q[i] # [n, d_k]
K_i = X * W_K[i] # [n, d_k]
V_i = X * W_V[i] # [n, d_k]
head_i = attention(Q_i, K_i, V_i) # [n, d_k]
heads.append(head_i)
concat = horizontal_concat(heads) # [n, h*d_k]
return concat * W_O # [n, d_model]import numpy as np
def softmax(x):
x = x - x.max(axis=-1, keepdims=True)
e = np.exp(x)
return e / e.sum(axis=-1, keepdims=True)
def attention(Q, K, V):
d_k = Q.shape[-1]
return softmax(Q @ K.T / np.sqrt(d_k)) @ V
def multi_head_attention(X, W_Q, W_K, W_V, W_O, h):
"""
X: (n, d_model)
W_Q, W_K, W_V: each (h, d_model, d_k)
W_O: (h * d_k, d_model)
Returns (n, d_model)
"""
heads = []
for i in range(h):
Q_i = X @ W_Q[i] # (n, d_k)
K_i = X @ W_K[i]
V_i = X @ W_V[i]
heads.append(attention(Q_i, K_i, V_i))
concat = np.concatenate(heads, axis=-1) # (n, h*d_k)
return concat @ W_O # (n, d_model)function softmax(row) {
const max = Math.max(...row);
const e = row.map(x => Math.exp(x - max));
const s = e.reduce((a, b) => a + b, 0);
return e.map(x => x / s);
}
function matMul(A, B) {
const m = A.length, k = B.length, n = B[0].length;
return Array.from({length: m}, (_, i) =>
Array.from({length: n}, (_, j) =>
A[i].reduce((s, _, l) => s + A[i][l] * B[l][j], 0)));
}
function transpose(M) { return M[0].map((_, j) => M.map(r => r[j])); }
function attention(Q, K, V) {
const dK = K[0].length;
const scores = matMul(Q, transpose(K)).map(r => r.map(s => s / Math.sqrt(dK)));
return matMul(scores.map(softmax), V);
}
// W_Q, W_K, W_V: array of h matrices, each [d_model][d_k]
// W_O: [h*d_k][d_model]
function multiHeadAttention(X, W_Q, W_K, W_V, W_O, h) {
const heads = [];
for (let i = 0; i < h; i++) {
const Qi = matMul(X, W_Q[i]);
const Ki = matMul(X, W_K[i]);
const Vi = matMul(X, W_V[i]);
heads.push(attention(Qi, Ki, Vi));
}
// horizontal concatenate heads: each is [n][d_k]
const concat = heads[0].map((row, n) => heads.flatMap(h => h[n]));
return matMul(concat, W_O);
}/* Simplified: fixed n, d_model, h, dk = d_model/h */
#include <math.h>
#include <string.h>
#define N 4
#define DM 8
#define H 2
#define DK (DM/H)
void attention_head(double Q[N][DK], double K[N][DK], double V[N][DK],
double out[N][DK]) {
double scores[N][N];
double sq = sqrt((double)DK);
for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) {
double s = 0; for (int l = 0; l < DK; l++) s += Q[i][l]*K[j][l];
scores[i][j] = s / sq;
}
for (int i = 0; i < N; i++) {
double mx = scores[i][0]; for (int j=1;j<N;j++) if(scores[i][j]>mx) mx=scores[i][j];
double s = 0; for (int j=0;j<N;j++){scores[i][j]=exp(scores[i][j]-mx);s+=scores[i][j];}
for (int j=0;j<N;j++) scores[i][j]/=s;
}
for (int i=0;i<N;i++) for (int d=0;d<DK;d++) {
double s=0; for(int j=0;j<N;j++) s+=scores[i][j]*V[j][d]; out[i][d]=s;
}
}
/* W_Q[h][DM][DK], W_K[h][DM][DK], W_V[h][DM][DK], W_O[H*DK][DM] */
void multi_head_attention(double X[N][DM],
double W_Q[H][DM][DK], double W_K[H][DM][DK], double W_V[H][DM][DK],
double W_O[H*DK][DM], double out[N][DM]) {
double heads[H][N][DK];
for (int h = 0; h < H; h++) {
double Q[N][DK], K[N][DK], V[N][DK];
for (int i=0;i<N;i++) for (int d=0;d<DK;d++) {
double q=0,k=0,v=0;
for (int m=0;m<DM;m++){q+=X[i][m]*W_Q[h][m][d];k+=X[i][m]*W_K[h][m][d];v+=X[i][m]*W_V[h][m][d];}
Q[i][d]=q; K[i][d]=k; V[i][d]=v;
}
attention_head(Q, K, V, heads[h]);
}
/* concat heads and project with W_O */
for (int i=0;i<N;i++) for (int d=0;d<DM;d++) {
double s=0;
for (int h=0;h<H;h++) for (int dk=0;dk<DK;dk++) s+=heads[h][i][dk]*W_O[h*DK+dk][d];
out[i][d]=s;
}
}#include <vector>
#include <cmath>
#include <algorithm>
using namespace std;
using Mat = vector<vector<double>>;
Mat matMul(const Mat& A, const Mat& B) {
int m=A.size(),k=B.size(),n=B[0].size(); Mat C(m,vector<double>(n,0));
for(int i=0;i<m;i++) for(int l=0;l<k;l++) for(int j=0;j<n;j++) C[i][j]+=A[i][l]*B[l][j];
return C;
}
Mat transpose(const Mat& M) {
Mat T(M[0].size(),vector<double>(M.size()));
for(int i=0;i<(int)M.size();i++) for(int j=0;j<(int)M[0].size();j++) T[j][i]=M[i][j];
return T;
}
Mat attention(const Mat& Q, const Mat& K, const Mat& V) {
double sq=sqrt((double)K[0].size()); Mat S=matMul(Q,transpose(K));
for(auto& r:S){ double mx=*max_element(r.begin(),r.end()),s=0;
for(auto& x:r){x=exp(x/sq-mx);s+=x;} for(auto& x:r) x/=s;}
return matMul(S,V);
}
// W_Q/K/V: vector of h matrices [d_model x d_k]; W_O: [h*d_k x d_model]
Mat multiHeadAttention(const Mat& X, const vector<Mat>& WQ, const vector<Mat>& WK,
const vector<Mat>& WV, const Mat& WO) {
int h=WQ.size(); Mat concat;
for(int i=0;i<h;i++){
Mat head=attention(matMul(X,WQ[i]),matMul(X,WK[i]),matMul(X,WV[i]));
if(concat.empty()) concat=head;
else for(int r=0;r<(int)head.size();r++)
concat[r].insert(concat[r].end(),head[r].begin(),head[r].end());
}
return matMul(concat,WO);
}import java.util.Arrays;
public class MultiHeadAttention {
static double[][] matMul(double[][] A, double[][] B) {
int m=A.length,k=B.length,n=B[0].length; double[][] C=new double[m][n];
for(int i=0;i<m;i++) for(int l=0;l<k;l++) for(int j=0;j<n;j++) C[i][j]+=A[i][l]*B[l][j];
return C;
}
static double[][] transpose(double[][] M) {
double[][] T=new double[M[0].length][M.length];
for(int i=0;i<M.length;i++) for(int j=0;j<M[0].length;j++) T[j][i]=M[i][j];
return T;
}
static double[][] attention(double[][] Q, double[][] K, double[][] V) {
double sq=Math.sqrt(K[0].length); double[][] S=matMul(Q,transpose(K));
for(double[] r:S){
double mx=Arrays.stream(r).max().getAsDouble(),s=0;
for(int j=0;j<r.length;j++){r[j]=Math.exp(r[j]/sq-mx);s+=r[j];}
for(int j=0;j<r.length;j++) r[j]/=s;
}
return matMul(S,V);
}
// WQ, WK, WV: h x d_model x d_k; WO: h*d_k x d_model
public static double[][] forward(double[][] X, double[][][] WQ, double[][][] WK,
double[][][] WV, double[][] WO) {
int h=WQ.length, n=X.length, dk=WQ[0][0].length;
double[][] concat=new double[n][h*dk];
for(int i=0;i<h;i++){
double[][] head=attention(matMul(X,WQ[i]),matMul(X,WK[i]),matMul(X,WV[i]));
for(int r=0;r<n;r++) for(int d=0;d<dk;d++) concat[r][i*dk+d]=head[r][d];
}
return matMul(concat,WO);
}
}package attention
import "math"
func matMul(A, B [][]float64) [][]float64 {
m, k, n := len(A), len(B), len(B[0])
C := make([][]float64, m)
for i := range C { C[i] = make([]float64, n)
for l := 0; l < k; l++ { for j := 0; j < n; j++ { C[i][j] += A[i][l] * B[l][j] } } }
return C
}
func transpose(M [][]float64) [][]float64 {
T := make([][]float64, len(M[0]))
for j := range T { T[j] = make([]float64, len(M))
for i := range M { T[j][i] = M[i][j] } }
return T
}
func singleAttention(Q, K, V [][]float64) [][]float64 {
sq := math.Sqrt(float64(len(K[0]))); S := matMul(Q, transpose(K))
for _, row := range S {
mx := row[0]; for _, v := range row { if v > mx { mx = v } }
s := 0.0; for j := range row { row[j] = math.Exp(row[j]/sq - mx); s += row[j] }
for j := range row { row[j] /= s }
}
return matMul(S, V)
}
// WQ, WK, WV: slice of h matrices [d_model x d_k]; WO: [h*d_k x d_model]
func MultiHeadAttention(X [][]float64, WQ, WK, WV [][]float64slice, WO [][]float64) [][]float64 {
h := len(WQ); n := len(X); dk := len(WQ[0][0])
concat := make([][]float64, n)
for i := range concat { concat[i] = make([]float64, h*dk) }
for i := 0; i < h; i++ {
head := singleAttention(matMul(X, WQ[i]), matMul(X, WK[i]), matMul(X, WV[i]))
for r := 0; r < n; r++ { for d := 0; d < dk; d++ { concat[r][i*dk+d] = head[r][d] } }
}
return matMul(concat, WO)
}8. Positional encoding
Here is a subtle problem: self-attention is permutation-equivariant. If you shuffle the order of the input tokens, the output tokens shuffle in the same way but their values are unchanged. That means the attention operation, by itself, has no idea about token order. "The cat sat" and "Sat cat the" would be processed identically.
Language cares desperately about order. So we inject position information by adding a positional encoding vector to each token embedding before the first self-attention layer.
Sinusoidal positional encoding (the original)
Each dimension of the positional encoding is a sine or cosine wave at a different frequency. The frequencies span from very fast (position unique over a few tokens) to very slow (position unique over tens of thousands of tokens). The network can learn to recover $\text{pos}$ from this encoding because different positions give linearly-related encodings.
Learned positional embeddings
Another option: treat position as a vocabulary and learn an embedding for each position, exactly like token embeddings. Simpler, works about as well, but has a fixed maximum length built in.
Rotary positional embedding (RoPE)
The modern favorite, used in LLaMA, Qwen, and most recent LLMs. Instead of adding a position vector to the embedding, RoPE rotates pairs of components of the query and key by an angle proportional to their position. Dot products $q_i \cdot k_j$ then depend on the relative position $i - j$ rather than absolute positions, which turns out to generalize much better to longer sequences.
Source — Sinusoidal Positional Encoding
function sinusoidal_pe(max_len, d_model):
PE = zeros(max_len, d_model)
for pos in 0..max_len-1:
for i in 0..d_model/2-1:
angle = pos / 10000^(2*i / d_model)
PE[pos][2*i] = sin(angle)
PE[pos][2*i+1] = cos(angle)
return PE # shape [max_len, d_model]
# Usage: add PE to token embeddings before the first attention layer
# X = token_embeddings + PE[:n]import numpy as np
def sinusoidal_positional_encoding(max_len, d_model):
"""
Returns PE of shape (max_len, d_model).
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
PE = np.zeros((max_len, d_model))
positions = np.arange(max_len)[:, np.newaxis] # (max_len, 1)
dims = np.arange(0, d_model, 2) # [0, 2, 4, ...]
angles = positions / np.power(10000, dims / d_model) # (max_len, d_model/2)
PE[:, 0::2] = np.sin(angles)
PE[:, 1::2] = np.cos(angles)
return PEfunction sinusoidalPE(maxLen, dModel) {
const PE = Array.from({length: maxLen}, () => new Array(dModel).fill(0));
for (let pos = 0; pos < maxLen; pos++) {
for (let i = 0; i < dModel / 2; i++) {
const angle = pos / Math.pow(10000, (2 * i) / dModel);
PE[pos][2 * i] = Math.sin(angle);
PE[pos][2 * i + 1] = Math.cos(angle);
}
}
return PE;
}#include <math.h>
#include <stdlib.h>
/* Returns heap-allocated PE[max_len][d_model] (caller frees each row) */
double** sinusoidal_pe(int max_len, int d_model) {
double** PE = malloc(max_len * sizeof(double*));
for (int pos = 0; pos < max_len; pos++) {
PE[pos] = calloc(d_model, sizeof(double));
for (int i = 0; i < d_model / 2; i++) {
double angle = pos / pow(10000.0, 2.0 * i / d_model);
PE[pos][2 * i] = sin(angle);
PE[pos][2 * i + 1] = cos(angle);
}
}
return PE;
}#include <vector>
#include <cmath>
using namespace std;
vector<vector<double>> sinusoidalPE(int maxLen, int dModel) {
vector<vector<double>> PE(maxLen, vector<double>(dModel, 0.0));
for (int pos = 0; pos < maxLen; pos++) {
for (int i = 0; i < dModel / 2; i++) {
double angle = pos / pow(10000.0, 2.0 * i / dModel);
PE[pos][2 * i] = sin(angle);
PE[pos][2 * i + 1] = cos(angle);
}
}
return PE;
}public class SinusoidalPE {
public static double[][] compute(int maxLen, int dModel) {
double[][] PE = new double[maxLen][dModel];
for (int pos = 0; pos < maxLen; pos++) {
for (int i = 0; i < dModel / 2; i++) {
double angle = pos / Math.pow(10000.0, 2.0 * i / dModel);
PE[pos][2 * i] = Math.sin(angle);
PE[pos][2 * i + 1] = Math.cos(angle);
}
}
return PE;
}
}package attention
import "math"
func SinusoidalPE(maxLen, dModel int) [][]float64 {
PE := make([][]float64, maxLen)
for pos := range PE {
PE[pos] = make([]float64, dModel)
for i := 0; i < dModel/2; i++ {
angle := float64(pos) / math.Pow(10000, float64(2*i)/float64(dModel))
PE[pos][2*i] = math.Sin(angle)
PE[pos][2*i+1] = math.Cos(angle)
}
}
return PE
}9. A full transformer block
Self-attention is powerful on its own but not quite enough. A transformer layer wraps multi-head self-attention together with a feedforward network and some crucial plumbing (residual connections and layer normalization):
# Input: x of shape [n, d_model] a = LayerNorm(x) a = MultiHeadAttention(a, a, a) # Q, K, V all from a x = x + a # residual b = LayerNorm(x) b = Linear(d_ff)(b) # expand (typically d_ff = 4·d_model) b = GELU(b) b = Linear(d_model)(b) # project back x = x + b # residual return x
- LayerNorm normalizes each token vector independently (subtract mean, divide by std), stabilizing training across depth.
- Residual connections ($x + f(x)$) let gradients flow directly from any layer to any earlier layer, avoiding the vanishing-gradient problem.
- The feedforward sublayer is applied independently to each token (it's a 2-layer MLP). Despite having no mixing across tokens, this sublayer carries most of the transformer's parameters and does most of the "thinking" — self-attention shuffles information between positions, the FFN processes it.
Stack 12 of these blocks: you have BERT-base or GPT-1. Stack 96 blocks with $d_\text{model} = 12{,}288$ and 96 heads: you have GPT-3 (175B parameters). Stack a few hundred with mixture-of-experts and you have GPT-4, Claude, Gemini, Grok. The architecture is the same; what changed is scale and training data.
Source — Transformer Encoder Block
function transformer_encoder_block(x, params):
# Pre-norm variant (used by most modern models)
# x: [n, d_model]
# --- Self-attention sub-layer ---
a = layer_norm(x)
a = multi_head_attention(a, a, a, params.W_Q, params.W_K, params.W_V, params.W_O)
x = x + a # residual connection
# --- Feed-forward sub-layer ---
b = layer_norm(x)
b = linear(b, params.W1, params.b1) # [n, d_ff] d_ff = 4 * d_model
b = gelu(b)
b = linear(b, params.W2, params.b2) # [n, d_model]
x = x + b # residual connection
return ximport numpy as np
def layer_norm(x, eps=1e-5):
mean = x.mean(axis=-1, keepdims=True)
std = x.std(axis=-1, keepdims=True)
return (x - mean) / (std + eps)
def gelu(x):
import math
return 0.5 * x * (1 + np.tanh(math.sqrt(2/math.pi) * (x + 0.044715 * x**3)))
def softmax(x):
x = x - x.max(axis=-1, keepdims=True)
e = np.exp(x); return e / e.sum(axis=-1, keepdims=True)
def mha(X, WQ, WK, WV, WO, h):
heads = []
for i in range(h):
d_k = WQ[i].shape[1]
S = softmax((X @ WQ[i]) @ (X @ WK[i]).T / np.sqrt(d_k))
heads.append(S @ (X @ WV[i]))
return np.concatenate(heads, axis=-1) @ WO
def transformer_encoder_block(x, WQ, WK, WV, WO, W1, b1, W2, b2, h):
"""
x: (n, d_model)
WQ/WK/WV: (h, d_model, d_k) WO: (h*d_k, d_model)
W1: (d_model, d_ff) b1: (d_ff,)
W2: (d_ff, d_model) b2: (d_model,)
"""
# Attention sub-layer
a = layer_norm(x)
a = mha(a, WQ, WK, WV, WO, h)
x = x + a
# Feed-forward sub-layer
b = layer_norm(x)
b = gelu(b @ W1 + b1)
b = b @ W2 + b2
x = x + b
return xfunction layerNorm(X, eps=1e-5) {
return X.map(row => {
const mean = row.reduce((a,b)=>a+b,0)/row.length;
const std = Math.sqrt(row.reduce((a,v)=>a+(v-mean)**2,0)/row.length + eps);
return row.map(v => (v-mean)/std);
});
}
function gelu(x) {
return x.map(row => row.map(v =>
0.5*v*(1+Math.tanh(Math.sqrt(2/Math.PI)*(v+0.044715*v**3)))));
}
function softmax(row){const m=Math.max(...row),e=row.map(x=>Math.exp(x-m)),s=e.reduce((a,b)=>a+b,0);return e.map(x=>x/s);}
function matMul(A,B){const m=A.length,k=B.length,n=B[0].length;return Array.from({length:m},(_,i)=>Array.from({length:n},(_,j)=>A[i].reduce((s,_,l)=>s+A[i][l]*B[l][j],0)));}
function transpose(M){return M[0].map((_,j)=>M.map(r=>r[j]));}
function addBias(M,b){return M.map(row=>row.map((v,j)=>v+b[j]));}
function transformerEncoderBlock(x, WQ, WK, WV, WO, W1, b1, W2, b2, h) {
// Attention
let a = layerNorm(x);
const heads = WQ.map((_,i)=>{
const dk=WQ[i][0].length;
const S=matMul(matMul(a,WQ[i]),transpose(matMul(a,WK[i]))).map(r=>softmax(r.map(v=>v/Math.sqrt(dk))));
return matMul(S,matMul(a,WV[i]));
});
const concat = heads[0].map((row,n)=>heads.flatMap(h=>h[n]));
a = matMul(concat, WO);
x = x.map((row,i)=>row.map((v,j)=>v+a[i][j]));
// FFN
let b = layerNorm(x);
b = gelu(addBias(matMul(b,W1),b1));
b = addBias(matMul(b,W2),b2);
return x.map((row,i)=>row.map((v,j)=>v+b[i][j]));
}/* Sketch: calls previously defined helpers (layer_norm, attention, matmul, gelu).
Assumes fixed dimensions N, DM, H, DK=DM/H, DF=4*DM. */
#include <math.h>
#include <string.h>
/* gelu approximation */
static double gelu(double x) {
return 0.5*x*(1.0+tanh(0.7978845608*(x+0.044715*x*x*x)));
}
void transformer_encoder_block(
int n, int dm, int h, int dk, int df,
double x[n][dm],
/* MHA weights */
double WQ[/*h*/][dm][dk], double WK[/*h*/][dm][dk],
double WV[/*h*/][dm][dk], double WO[/*h*dk*/][dm],
/* FFN weights */
double W1[dm][df], double b1[df],
double W2[df][dm], double b2[dm],
double out[n][dm])
{
/* layer_norm + MHA + residual -- schematic */
/* (full implementation omitted for brevity; uses helpers from prior sections) */
/* b = LayerNorm(x); a = MHA(b); x += a */
/* b = LayerNorm(x); b = W2*(gelu(W1*b + b1)) + b2; x += b */
/* copy to out */
memcpy(out, x, n*dm*sizeof(double));
}#include <vector>
#include <cmath>
#include <numeric>
using namespace std;
using Mat = vector<vector<double>>;
Mat layerNorm(const Mat& X, double eps=1e-5) {
Mat Y = X;
for (auto& row : Y) {
double mean = accumulate(row.begin(),row.end(),0.0)/row.size();
double var = 0; for (double v:row) var+=(v-mean)*(v-mean); var/=row.size();
for (auto& v:row) v=(v-mean)/sqrt(var+eps);
}
return Y;
}
Mat geluMat(const Mat& X) {
Mat Y=X; for(auto& row:Y) for(auto& v:row)
v=0.5*v*(1+tanh(0.7978845608*(v+0.044715*v*v*v)));
return Y;
}
Mat addMat(const Mat& A, const Mat& B){
Mat C=A; for(int i=0;i<(int)A.size();i++) for(int j=0;j<(int)A[0].size();j++) C[i][j]+=B[i][j];
return C;
}
// Calls multiHeadAttention from prior section
extern Mat multiHeadAttention(const Mat&,const vector<Mat>&,const vector<Mat>&,const vector<Mat>&,const Mat&);
extern Mat matMul(const Mat&,const Mat&);
Mat transformerEncoderBlock(const Mat& x, const vector<Mat>& WQ, const vector<Mat>& WK,
const vector<Mat>& WV, const Mat& WO,
const Mat& W1, const vector<double>& b1,
const Mat& W2, const vector<double>& b2) {
// Attention sub-layer
Mat a = layerNorm(x);
a = multiHeadAttention(a, WQ, WK, WV, WO);
Mat x2 = addMat(x, a);
// FFN sub-layer
Mat b = layerNorm(x2);
b = matMul(b, W1);
for(int i=0;i<(int)b.size();i++) for(int j=0;j<(int)b[0].size();j++) b[i][j]+=b1[j];
b = geluMat(b);
b = matMul(b, W2);
for(int i=0;i<(int)b.size();i++) for(int j=0;j<(int)b[0].size();j++) b[i][j]+=b2[j];
return addMat(x2, b);
}import java.util.Arrays;
public class TransformerEncoderBlock {
static double gelu(double x) {
return 0.5*x*(1+Math.tanh(0.7978845608*(x+0.044715*x*x*x)));
}
static double[][] layerNorm(double[][] X, double eps) {
double[][] Y = new double[X.length][X[0].length];
for (int i=0;i<X.length;i++) {
double mean=Arrays.stream(X[i]).average().orElse(0);
double var=Arrays.stream(X[i]).map(v->(v-mean)*(v-mean)).average().orElse(0);
double std=Math.sqrt(var+eps);
for(int j=0;j<X[i].length;j++) Y[i][j]=(X[i][j]-mean)/std;
}
return Y;
}
static double[][] matMul(double[][] A, double[][] B) {
int m=A.length,k=B.length,n=B[0].length; double[][] C=new double[m][n];
for(int i=0;i<m;i++) for(int l=0;l<k;l++) for(int j=0;j<n;j++) C[i][j]+=A[i][l]*B[l][j];
return C;
}
// forward: schematic using helpers defined in prior classes
// (MultiHeadAttention.forward, layerNorm, matMul)
public static double[][] forward(double[][] x,
double[][][] WQ, double[][][] WK, double[][][] WV, double[][] WO,
double[][] W1, double[] b1, double[][] W2, double[] b2) {
// Attention sub-layer
double[][] a = layerNorm(x, 1e-5);
a = MultiHeadAttention.forward(a, WQ, WK, WV, WO);
for(int i=0;i<x.length;i++) for(int j=0;j<x[0].length;j++) x[i][j]+=a[i][j];
// FFN sub-layer
double[][] b = layerNorm(x, 1e-5);
b = matMul(b, W1);
for(int i=0;i<b.length;i++) for(int j=0;j<b[0].length;j++) b[i][j]=gelu(b[i][j]+b1[j]);
b = matMul(b, W2);
for(int i=0;i<b.length;i++) for(int j=0;j<b[0].length;j++) x[i][j]+=b[i][j]+b2[j];
return x;
}
}package attention
import (
"math"
)
func layerNorm(X [][]float64, eps float64) [][]float64 {
Y := make([][]float64, len(X))
for i, row := range X {
mean := 0.0; for _, v := range row { mean += v }; mean /= float64(len(row))
vari := 0.0; for _, v := range row { vari += (v-mean)*(v-mean) }; vari /= float64(len(row))
std := math.Sqrt(vari + eps)
Y[i] = make([]float64, len(row))
for j, v := range row { Y[i][j] = (v - mean) / std }
}
return Y
}
func geluVal(x float64) float64 {
return 0.5 * x * (1 + math.Tanh(0.7978845608*(x+0.044715*x*x*x)))
}
func addMat(A, B [][]float64) [][]float64 {
C := make([][]float64, len(A))
for i := range A { C[i] = make([]float64, len(A[i]))
for j := range A[i] { C[i][j] = A[i][j] + B[i][j] } }
return C
}
// TransformerEncoderBlock: pre-norm encoder block.
// Calls MultiHeadAttention and matMul defined in prior sections.
func TransformerEncoderBlock(x [][]float64,
WQ, WK, WV [][]float64, WO [][]float64,
W1 [][]float64, b1 []float64,
W2 [][]float64, b2 []float64) [][]float64 {
// Attention sub-layer
a := MultiHeadAttention(layerNorm(x, 1e-5), WQ, WK, WV, WO)
x = addMat(x, a)
// FFN sub-layer
b := matMul(layerNorm(x, 1e-5), W1)
for i := range b { for j := range b[i] { b[i][j] = geluVal(b[i][j] + b1[j]) } }
b = matMul(b, W2)
for i := range b { for j := range b[i] { b[i][j] += b2[j] } }
return addMat(x, b)
}10. Cost, causal masks, and variants
Quadratic cost
The core bottleneck: $Q K^\top$ has shape $n \times n$. Memory and compute scale as $O(n^2 d)$ in sequence length. For $n = 1000$ it's fine; for $n = 100{,}000$ it's painful; for $n = 1{,}000{,}000$ it's currently infeasible without tricks. Much modern research is about reducing this cost: FlashAttention (fuse the kernels to avoid materializing the attention matrix), linear attention (approximate softmax with a kernel that factors as $\phi(Q) \phi(K)^\top$), sliding-window attention, sparse attention patterns, and so on.
Causal masking
For language modeling, you only want a token to attend to tokens at earlier positions — the model shouldn't see the future. We enforce this by adding $-\infty$ to the upper-triangular entries of $QK^\top / \sqrt{d_k}$ before the softmax. Those entries become exactly zero in the softmax output, so information from future positions cannot flow back. Encoder-only models like BERT skip the mask; decoder-only models like GPT use it.
Source — Causal Attention Mask
function causal_mask(n):
# Returns lower-triangular boolean matrix [n, n]
# mask[i][j] = True if j <= i (token i may attend to token j)
mask = zeros(n, n, dtype=bool)
for i in 0..n-1:
for j in 0..i:
mask[i][j] = True
return mask
function apply_causal_mask(scores, mask):
# scores: [n, n] mask: [n, n] bool
# Set upper-triangular entries to -inf so they vanish in softmax
for i in 0..n-1:
for j in i+1..n-1:
scores[i][j] = -infinity
return scoresimport numpy as np
def causal_mask(n):
"""Lower-triangular boolean mask: mask[i, j] = (j <= i)."""
return np.tril(np.ones((n, n), dtype=bool))
def masked_scaled_dot_product_attention(Q, K, V):
"""Causal (decoder) attention — token i cannot see token j > i."""
n, d_k = Q.shape[0], Q.shape[1]
scores = Q @ K.T / np.sqrt(d_k) # (n, n)
mask = causal_mask(n)
scores = np.where(mask, scores, -1e9) # mask future positions
scores = scores - scores.max(axis=-1, keepdims=True)
weights = np.exp(scores)
weights = weights / weights.sum(axis=-1, keepdims=True)
return weights @ V # (n, d_v)function causalMask(n) {
// mask[i][j] = true iff j <= i
return Array.from({length: n}, (_, i) =>
Array.from({length: n}, (_, j) => j <= i));
}
function maskedAttention(Q, K, V) {
const n = Q.length, dK = K[0].length;
const KT = K[0].map((_, j) => K.map(r => r[j]));
const mask = causalMask(n);
const scores = Q.map((qi, i) =>
KT[0].map((_, j) => {
const raw = qi.reduce((s, v, l) => s + v * K[j][l], 0) / Math.sqrt(dK);
return mask[i][j] ? raw : -1e9;
})
);
const weights = scores.map(row => {
const max = Math.max(...row);
const e = row.map(v => Math.exp(v - max));
const s = e.reduce((a, b) => a + b, 0);
return e.map(v => v / s);
});
return weights.map(wi =>
V[0].map((_, d) => wi.reduce((s, w, j) => s + w * V[j][d], 0)));
}#include <math.h>
#include <stdbool.h>
/* Build causal mask: mask[i*n + j] = (j <= i) */
void causal_mask(int n, bool mask[n][n]) {
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
mask[i][j] = (j <= i);
}
/* Apply mask to scores in-place: set upper triangle to -1e9 */
void apply_causal_mask(int n, double scores[n][n]) {
for (int i = 0; i < n; i++)
for (int j = i + 1; j < n; j++)
scores[i][j] = -1e9;
}
/* Masked attention: Q,K[n][dk], V[n][dv], out[n][dv] */
void causal_attention(int n, int dk, int dv,
double Q[n][dk], double K[n][dk], double V[n][dv], double out[n][dv])
{
double scores[n][n];
double sq = sqrt((double)dk);
for (int i=0;i<n;i++) for(int j=0;j<n;j++){
double s=0; for(int l=0;l<dk;l++) s+=Q[i][l]*K[j][l];
scores[i][j] = (j<=i) ? s/sq : -1e9;
}
for(int i=0;i<n;i++){
double mx=scores[i][0]; for(int j=1;j<n;j++) if(scores[i][j]>mx) mx=scores[i][j];
double sm=0; for(int j=0;j<n;j++){scores[i][j]=exp(scores[i][j]-mx);sm+=scores[i][j];}
for(int j=0;j<n;j++) scores[i][j]/=sm;
}
for(int i=0;i<n;i++) for(int d=0;d<dv;d++){
double s=0; for(int j=0;j<n;j++) s+=scores[i][j]*V[j][d]; out[i][d]=s;
}
}#include <vector>
#include <cmath>
#include <algorithm>
using namespace std;
using Mat = vector<vector<double>>;
Mat causalAttention(const Mat& Q, const Mat& K, const Mat& V) {
int n = Q.size(), dk = K[0].size(), dv = V[0].size();
double sq = sqrt((double)dk);
Mat scores(n, vector<double>(n));
for (int i=0;i<n;i++) for(int j=0;j<n;j++){
double s=0; for(int l=0;l<dk;l++) s+=Q[i][l]*K[j][l];
scores[i][j] = (j<=i) ? s/sq : -1e9;
}
for (auto& row : scores) {
double mx = *max_element(row.begin(), row.end());
double sm=0; for(auto& x:row){x=exp(x-mx);sm+=x;}
for(auto& x:row) x/=sm;
}
Mat out(n, vector<double>(dv, 0));
for(int i=0;i<n;i++) for(int d=0;d<dv;d++)
for(int j=0;j<n;j++) out[i][d]+=scores[i][j]*V[j][d];
return out;
}public class CausalAttention {
public static boolean[][] causalMask(int n) {
boolean[][] m = new boolean[n][n];
for (int i=0;i<n;i++) for(int j=0;j<=i;j++) m[i][j]=true;
return m;
}
public static double[][] maskedAttention(double[][] Q, double[][] K, double[][] V) {
int n=Q.length, dk=K[0].length, dv=V[0].length;
double sq=Math.sqrt(dk);
double[][] scores=new double[n][n];
boolean[][] mask=causalMask(n);
for(int i=0;i<n;i++) for(int j=0;j<n;j++){
if(!mask[i][j]){scores[i][j]=-1e9;continue;}
double s=0; for(int l=0;l<dk;l++) s+=Q[i][l]*K[j][l];
scores[i][j]=s/sq;
}
for(double[] r:scores){
double mx=Double.NEGATIVE_INFINITY; for(double v:r) if(v>mx) mx=v;
double sm=0; for(int j=0;j<r.length;j++){r[j]=Math.exp(r[j]-mx);sm+=r[j];}
for(int j=0;j<r.length;j++) r[j]/=sm;
}
double[][] out=new double[n][dv];
for(int i=0;i<n;i++) for(int d=0;d<dv;d++)
for(int j=0;j<n;j++) out[i][d]+=scores[i][j]*V[j][d];
return out;
}
}package attention
import "math"
func CausalAttention(Q, K, V [][]float64) [][]float64 {
n, dk, dv := len(Q), len(K[0]), len(V[0])
sq := math.Sqrt(float64(dk))
scores := make([][]float64, n)
for i := range scores {
scores[i] = make([]float64, n)
for j := 0; j < n; j++ {
if j > i { scores[i][j] = -1e9; continue }
s := 0.0
for l := 0; l < dk; l++ { s += Q[i][l] * K[j][l] }
scores[i][j] = s / sq
}
}
for _, row := range scores {
mx := row[0]; for _, v := range row { if v > mx { mx = v } }
sm := 0.0; for j := range row { row[j] = math.Exp(row[j] - mx); sm += row[j] }
for j := range row { row[j] /= sm }
}
out := make([][]float64, n)
for i := range out {
out[i] = make([]float64, dv)
for d := 0; d < dv; d++ { for j := 0; j < n; j++ { out[i][d] += scores[i][j] * V[j][d] } }
}
return out
}Cross-attention
In encoder-decoder models (original transformer, T5, machine translation) the decoder has an extra attention layer where the queries come from the decoder and the keys/values come from the encoder. This lets the decoder attend to the input sequence while generating the output. Decoder-only LLMs don't use cross-attention — they concatenate prompt and completion into one sequence instead.
Why attention wins
Three reasons attention replaced RNNs so completely: it parallelizes across the sequence (fit for GPUs), it handles long-range dependencies in a single hop, and it scales cleanly to enormous sizes. The constant factor on the quadratic cost turned out to be small enough that scaling the model was a better investment than optimizing the architecture. "Attention is all you need" sounded hyperbolic in 2017; six years later it was effectively true.
11. Summary
Self-attention takes a sequence of tokens and produces a new sequence in which each output is a weighted average of all input values, with weights determined by the softmax of scaled dot products between learned queries and keys. Multi-head attention runs several such operations in parallel on different projections, giving the network multiple "views" of the sequence. Positional encoding injects order information that attention would otherwise ignore. Stacked with feedforward layers and residual connections, this is the transformer block — the building block of every modern large language model.
Where to go next
- → Backpropagation — how gradients flow through softmax and matrix multiplications.
- → Gradient Descent — the optimization side of training a transformer.
- Attention Is All You Need (Vaswani et al., 2017) — the original transformer paper.
- Jay Alammar: The Illustrated Transformer — the visual companion every ML engineer has read.
- The Annotated Transformer — the paper implemented line-by-line in PyTorch with commentary.
- Karpathy: Let's build GPT from scratch — two-hour live-coded walkthrough of a decoder-only transformer.