Word Embeddings
How a bag of strings became a cloud of vectors — and how, in that cloud, king − man + woman suddenly landed on queen.
1. From symbols to vectors
A neural network can't ingest the string "cat" directly. It needs a vector of numbers. The question is: which numbers? The answer you choose determines what the network can learn, how efficiently, and whether the resulting model generalizes. For the first few decades of NLP, the answer was "an index into a vocabulary" or equivalently "a one-hot vector." The jump to dense learned vectors — embeddings — is one of those small conceptual moves that ends up being load-bearing for an entire field.
2. The one-hot baseline
Suppose your vocabulary has $V = 10{,}000$ words. The one-hot encoding of the $i$-th word is a vector in $\mathbb{R}^V$ with a 1 in position $i$ and zeros everywhere else:
This is technically a valid feature vector. You can feed it into a linear classifier. But notice two catastrophes:
- No similarity structure. The dot product of "cat" with "dog" is 0. The dot product of "cat" with "quasar" is also 0. Every word is equidistant from every other word. Nothing about the encoding says "cat and dog are similar animals."
- Huge and sparse. A 10,000-dim vector per word, of which 9,999 entries are zero. If you want to feed a 30-word sentence into a model, you're at 300,000 coordinates, almost all of them empty.
A dense embedding replaces the 10,000-dim sparse one-hot with, say, a 300-dim real vector where every coordinate is a learned float. Related words end up close together in this space. That's the whole idea. The question is: how do you learn the vectors?
3. The distributional hypothesis
"You shall know a word by the company it keeps."— J. R. Firth, 1957
This is the core insight. Words that appear in similar contexts tend to have similar meanings. "cat" and "dog" both show up near "pet," "food," "tail," "bark/meow," "vet." "cat" and "quasar" do not. If you can build a representation of a word from the contexts it shows up in, you're implicitly encoding its meaning.
Classical NLP implemented this with co-occurrence matrices: count how often each word appears near each other word, then decompose the resulting matrix with SVD to get dense vectors. That works, but it's expensive and doesn't scale to billion-word corpora. The modern alternative is to frame it as a small prediction task that a neural network can train on.
4. Word2Vec
In 2013, Tomáš Mikolov and colleagues at Google published Word2Vec — a pair of simple shallow networks (CBOW and Skip-gram) that could learn high-quality word embeddings from billions of tokens in hours. The results were surprising enough to reshape the field: the resulting vectors weren't just "similar words are close," they had linear algebraic structure. You could do arithmetic on them. king − man + woman ≈ queen. Paris − France + Italy ≈ Rome. Nobody had asked for this to happen; it fell out of the training objective.
Skip-gram with negative sampling
The Skip-gram variant is the one most people run. Here's the setup:
- For each word $w_t$ in your corpus, look at its context window — say, the 5 words before and 5 after.
- For each (center word, context word) pair, train a model to predict that this pairing is a real one from the corpus, not a random accident.
- To give the model negative examples, sample some random "context" words that definitely weren't near $w_t$, and train it to predict those as fake.
That's it. You don't need a language model, you don't need parse trees, you don't need supervised labels. All you need is a big text corpus.
Math, unpacked
Each word $w$ gets two learned vectors: a target vector $\mathbf{v}_w \in \mathbb{R}^d$ (used when $w$ is the center word) and a context vector $\mathbf{u}_w \in \mathbb{R}^d$ (used when $w$ is in the surrounding window). Typical $d$ is 100–300.
Given a center word $w_t$ and a context word $w_c$, the probability they were a real co-occurrence is modelled as:
The full negative-sampling loss for a single positive pair $(w_t, w_c)$ with $k$ negatives $w_n^{(1)}, \dots, w_n^{(k)}$ is:
Skip-gram loss, piece by piece
- $w_t$
- Center (target) word — the one we're currently looking at.
- $w_c$
- A positive context word — observed within a window of $w_t$ in the corpus.
- $w_n^{(i)}$
- A negative context word — drawn at random from the vocabulary (biased slightly toward frequent words). These are assumed to not be near $w_t$; they serve as "wrong answers" the model must learn to reject.
- $\mathbf{v}_{w_t}$
- Target embedding of $w_t$ — a row of the "input" embedding matrix. This is the vector you'll use at inference time: it's the representation of the word when it's the thing you're asking about.
- $\mathbf{u}_{w_c}$
- Context embedding of $w_c$ — a row of the "output" embedding matrix. A separate set of vectors, used only during training. Most people throw them away after training and keep only $\mathbf{v}$.
- $\sigma$
- Logistic sigmoid: $\sigma(z) = 1/(1+e^{-z})$. Squashes the dot product into a probability in $(0, 1)$.
- $\sigma(\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_c})$
- Model's estimated probability that $w_c$ really co-occurred with $w_t$. We want this near 1 for positive pairs.
- $\sigma(-\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_n^{(i)}})$
- Same thing but with a sign flip — this equals $1 - \sigma(\mathbf{v}_{w_t} \cdot \mathbf{u}_{w_n^{(i)}})$, the model's estimated probability that the negative pair is fake. We want this near 1 too.
- $k$
- Number of negative samples per positive pair. Typically 5–20. More negatives → slower training, better embeddings.
Analogy Imagine a dating app where you're shown pairs of words and asked "did these two hang out at a party in your corpus?" For positive pairs ("cat", "purr"), the model should say yes. For negative pairs ("cat", "platypus") — assuming the random draw really missed — it should say no. After billions of rounds, the word that consistently gets "yes" when paired with "purr" is "cat," and the vector for "cat" has drifted into a neighborhood where the dot products with "purr," "meow," "fur" are all high. Words that describe similar things end up in similar neighborhoods, because they pass the same tests.
Training is straightforward stochastic gradient descent on this loss. Because there are only dot products and sigmoids, it's very fast — you can train on a 100-million-word corpus in minutes on a single CPU.
5. The geometry of meaning
Here's the surprising thing. After you've trained Word2Vec on enough text, the vector space has structure you didn't explicitly put there. Measure any two embeddings by cosine similarity and nearest neighbors are semantically related: nearest(cat) ≈ [dog, kitten, feline, pet, puppy].
But more than that: certain directions in the space correspond to meaningful attributes. The vector $\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}}$ and the vector $\mathbf{v}_{\text{queen}} - \mathbf{v}_{\text{woman}}$ are approximately parallel. That means there's a "royalty" direction in the space, learned without any explicit supervision, such that subtracting "man" and adding "woman" roughly moves you along the gender direction while preserving royalty.
This result was the headline of the Word2Vec paper. It worked for many analogies: capital-country, verb-tense, plurals, gendered nouns, even some factual relations. It didn't work for everything — analogies involving very rare words broke down, and "bias" directions (gender stereotypes, racial bias) also emerged from the same mechanism and had to be studied and mitigated.
Interactive: explore a 2-D embedding space
Below is a hand-crafted 2-D embedding of 16 words, designed to illustrate the geometry. Click two words to see their cosine similarity, or click the analogy button to compute A − B + C and find the nearest word.
Try an analogy
Click "king" then "man" then "woman" then press A − B + C. The yellow arrow will point at the result and you'll see how close it lands to "queen."
You can also just click any two words to see their cosine similarity.
6. Source code
A minimal Skip-gram with negative sampling, in three versions.
import numpy as np
V, d, k, lr = 10000, 100, 5, 0.025
V_in = np.random.randn(V, d) * 0.01 # target ("input") matrix
V_out = np.random.randn(V, d) * 0.01 # context ("output") matrix
def sigmoid(z): return 1 / (1 + np.exp(-z))
def sgns_update(center, context, neg_samples):
v = V_in[center] # (d,)
u_pos = V_out[context] # (d,)
u_neg = V_out[neg_samples] # (k, d)
# positive term
p_pos = sigmoid(v @ u_pos)
grad_pos = p_pos - 1 # dL/d(v · u_pos)
# negative terms
scores_neg = u_neg @ v # (k,)
p_neg = sigmoid(scores_neg)
grad_neg = p_neg # dL/d(v · u_neg_i)
# update center vector (sum of all gradients flowing in)
dv = grad_pos * u_pos + grad_neg @ u_neg
V_in[center] -= lr * dv
# update context vectors
V_out[context] -= lr * grad_pos * v
V_out[neg_samples] -= lr * grad_neg[:, None] * v
import torch, torch.nn as nn, torch.nn.functional as F
class SkipGramNS(nn.Module):
def __init__(self, V, d):
super().__init__()
self.v_in = nn.Embedding(V, d)
self.v_out = nn.Embedding(V, d)
nn.init.uniform_(self.v_in.weight, -0.5/d, 0.5/d)
nn.init.zeros_ (self.v_out.weight)
def forward(self, center, context, negatives):
# center: (B,), context: (B,), negatives: (B, k)
v = self.v_in (center) # (B, d)
u_p = self.v_out(context) # (B, d)
u_n = self.v_out(negatives) # (B, k, d)
pos_score = (v * u_p).sum(dim=1) # (B,)
neg_score = torch.bmm(u_n, v.unsqueeze(2)).squeeze(2) # (B, k)
loss = -F.logsigmoid( pos_score).mean() \
-F.logsigmoid(-neg_score).mean()
return loss
# The one everyone actually uses in practice
from gensim.models import Word2Vec
sentences = [["the", "cat", "sat", "on", "the", "mat"], ...]
model = Word2Vec(
sentences,
vector_size=300,
window=5,
min_count=5,
sg=1, # skip-gram (not CBOW)
negative=10, # negative samples per positive
workers=8,
epochs=5,
)
print(model.wv.most_similar("cat"))
print(model.wv.most_similar(positive=["king", "woman"], negative=["man"]))
7. Beyond Word2Vec
Word2Vec was 2013. The embedding story has moved on several times since.
- GloVe (2014) — Stanford's alternative, explicitly fitting vectors to match log co-occurrence ratios. Similar quality, different math.
- FastText (2016) — Represents each word as a sum of character n-gram embeddings. Handles out-of-vocabulary words (subword units) and does better on morphologically rich languages.
- ELMo (2018) — Instead of one vector per word, a bi-LSTM language model produces a contextual embedding: the vector for "bank" is different in "river bank" vs. "central bank." This was the beginning of the end for static embeddings.
- BERT (2018) and every LLM since — every token at every position produces its own contextual embedding as a hidden state of a Transformer. "Word embedding" in the modern sense usually means "the contextualized hidden state of a particular token."
- Sentence / document embeddings — Sentence-BERT, OpenAI text-embedding-3, Cohere embeddings: produce a single vector per sentence or document, trained so similar meanings map to close vectors. These power semantic search, RAG, clustering, and nearest-neighbor recommendation.
But the core idea — words are points in a learned real-valued space, where geometry encodes meaning — is unchanged from 2013. Every modern LLM still starts by converting tokens into dense vectors via an embedding matrix. That matrix is the first thing the Transformer sees and the thing it writes back to when it generates the next token. Embeddings are the bridge between discrete symbols and continuous geometry, and every neural model of language crosses that bridge first.
8. Summary
- Embeddings replace one-hot symbol encodings with dense, learned vectors that capture similarity.
- They're justified by the distributional hypothesis: words with similar contexts have similar meanings.
- Word2Vec (Mikolov et al. 2013) learns embeddings by training a shallow network to distinguish real (word, context) pairs from random negatives — the skip-gram with negative sampling loss.
- The resulting vector space has surprising linear structure: analogies like $\text{king} - \text{man} + \text{woman} \approx \text{queen}$ work as vector arithmetic.
- Static embeddings were superseded by contextual embeddings (ELMo, BERT), where each token's vector depends on its context. But every LLM still has an embedding matrix at the bottom of the stack.
Further reading
- Mikolov et al. (2013) — Efficient Estimation of Word Representations in Vector Space.
- Mikolov et al. (2013) — Distributed Representations of Words and Phrases and their Compositionality (the negative-sampling paper).
- Pennington, Socher & Manning (2014) — GloVe: Global Vectors for Word Representation.
- Bojanowski et al. (2016) — Enriching Word Vectors with Subword Information (FastText).
- Peters et al. (2018) — Deep Contextualized Word Representations (ELMo).