Retrieval-Augmented Generation

The cheapest and most widely-deployed technique on the 2024–25 frontier. Instead of teaching a model every fact in your private corpus, you let it look things up at inference time. This page covers dense retrieval, hybrid search, rerankers, and the pipeline that powers every "chat with your docs" product.

Prereq: embeddings, LLMs Time to read: ~18 min Interactive figures: 1 Code: NumPy, Python, SQL

1. The core idea

Foundation models know a lot, but they don't know your data. Your product docs, your codebase, yesterday's Slack messages, last month's customer tickets — none of that is in the pretraining corpus. The straightforward fix — fine-tune the model on your data — is expensive, slow to update, and leaks memorization. RAG chooses a different path:

Keep the base model frozen.
Put your data in an external store.
At query time, search the store for passages relevant to the query.
Paste those passages into the prompt.
Let the model generate its answer conditioned on the retrieved context.

It's almost embarrassingly simple. It's also wildly effective, because modern LLMs are extremely good at the last step — given the right chunk of text in front of them, they extract the answer with high accuracy. The whole engineering game is "find the right chunk."

RAG was formalized by Lewis et al. (Facebook AI, 2020) but the technique builds on a much older line of open-domain QA research. Its explosion in 2023 was driven by two things: embeddings got good enough to find semantically relevant text without keyword overlap, and vector databases (or pgvector) made it a one-command operation.

2. The pipeline

A standard 2024-era RAG pipeline has two phases. Indexing runs once per document update:

\text{docs} \xrightarrow{\text{chunk}} \text{passages} \xrightarrow{\text{embed}} \text{vectors} \xrightarrow{\text{insert}} \text{vector DB}

And querying runs per user request:

\text{query} \xrightarrow{\text{embed}} \mathbf{q} \xrightarrow{\text{search}} \text{top-}k \text{ passages} \xrightarrow{\text{prompt}} \text{LLM} \xrightarrow{} \text{answer}

Each arrow has half-a-dozen tunable choices. Chunk size (256? 1024? paragraphs? sentences?). Embedding model (OpenAI, BGE, E5, Nomic?). Vector DB (pgvector, Weaviate, Pinecone, Qdrant?). Top-$k$ (3? 10? 50?). Reranker (yes/no, what kind?). Prompt template (pasted verbatim, summarized, cited?). The whole field of "RAG engineering" is about finding good settings for all of these and then monitoring them as your data changes.

3. Dense embeddings

The foundation of RAG is the embedding model. It maps a piece of text to a vector such that semantically similar texts map to nearby vectors. Training is typically contrastive: given a query $q$ and a relevant passage $p^+$, and some random irrelevant passages $p^-_1, \dots, p^-_n$, minimize:

\mathcal{L} = -\log \frac{\exp(\mathbf{q} \cdot \mathbf{p}^+ / \tau)}{\exp(\mathbf{q} \cdot \mathbf{p}^+ / \tau) + \sum_{i=1}^{n} \exp(\mathbf{q} \cdot \mathbf{p}^-_i / \tau)}

Contrastive embedding loss

$\mathbf{q}$: The query embedding — an $d$-dimensional vector (typically 384, 768, or 1536) produced by running the query through the encoder.
$\mathbf{p}^+$: A positive passage embedding — a passage that actually answers the query. During training, these are mined from labeled QA pairs or from (question, linked passage) data.
$\mathbf{p}^-_i$: Negative passages — random other passages in the batch, or "hard negatives" mined to look similar to the positive but be wrong.
$\tau$: Temperature. Small $\tau$ sharpens the softmax; large $\tau$ smooths it. Usually around 0.01–0.1.
$\mathbf{q} \cdot \mathbf{p}$: Inner product between the query and passage embeddings. If embeddings are L2-normalized this equals cosine similarity.

What it teaches the model "Pull the positive pair together, push the negatives away." After a few hundred million such updates, the encoder learns to place semantically similar sentences close in vector space — regardless of the exact words. A query for "how to reset my password" will be close to a passage that starts with "If you've forgotten your credentials, click Forgot Password" even though no words overlap.

Best-in-class 2024 open models (BGE-M3, E5-Mistral, Nomic Embed) achieve MTEB scores in the high 60s on average across 56 retrieval tasks. For reference, 2019's BERT-based DPR scored in the low 40s. This is where most of the RAG-quality gains of the last two years have come from.

4. Cosine similarity and top-k

At query time you have a query vector $\mathbf{q}$ and a whole database of passage vectors $\mathbf{p}_1, \dots, \mathbf{p}_M$. You want the $k$ passages most similar to $\mathbf{q}$. The similarity metric is almost always cosine:

\text{sim}(\mathbf{q}, \mathbf{p}) = \frac{\mathbf{q} \cdot \mathbf{p}}{\|\mathbf{q}\| \, \|\mathbf{p}\|}

If all your vectors are pre-normalized to unit length, this simplifies to a dot product. For $M$ passages and a single query, a brute-force scan is $O(M d)$ — usually fast enough up to ~1M vectors on modern hardware. Beyond that, you use an approximate nearest neighbor (ANN) index: HNSW, IVF-PQ, ScaNN. These trade a small amount of recall (~95–99%) for orders of magnitude speedup, making billion-vector search tractable.

5. Interactive RAG demo

Below is a toy RAG index with 10 passages. Click a query to embed it (in this demo, into a hand-picked 2-D space) and see the top-3 retrieval. Notice how "password reset" matches "credentials" without any word overlap — that's dense retrieval doing its job.

Query:

Top-3 retrieved passages for the selected query, shown in a 2-D embedding space.

6. Sparse + dense hybrid

Dense retrieval is great at semantic similarity but bad at exact matches — proper nouns, rare tokens, product SKUs, error codes. Classical sparse retrieval (BM25) is the opposite: excellent on exact matches, blind to synonyms. The obvious move is to do both and combine:

\text{score}(\mathbf{q}, p) = \alpha \cdot \text{sim}_\text{dense}(\mathbf{q}, p) + (1 - \alpha) \cdot \text{BM25}(\mathbf{q}, p)

This is called hybrid search, and it almost always beats either component alone on real-world corpora. Modern vector DBs (Weaviate, Qdrant, pgvector + full-text) support hybrid search natively.

Another approach: Reciprocal Rank Fusion (RRF). Rather than weighted-summing scores, you sum the reciprocals of each system's rank:

\text{RRF}(p) = \sum_i \frac{1}{k + \text{rank}_i(p)}

RRF is scale-invariant (BM25 and cosine live in different numerical ranges) and needs no tuning.

7. Rerankers

Top-$k$ retrieval, even with hybrid search, still returns a lot of noise. The next stage is a reranker: a small cross-encoder that takes (query, passage) pairs and produces a real relevance score. Unlike the retriever, which embeds query and passages separately, a reranker jointly encodes both and can model fine-grained interactions:

s_{\text{rerank}} = f_\theta\big(\text{[CLS]} \, q \, \text{[SEP]} \, p\big)

Rerankers are too expensive to apply to your whole corpus (each (query, passage) needs its own forward pass), so the pipeline is always "retrieve top 50 with a cheap encoder, rerank those 50 with a cross-encoder, keep top 5 for the LLM." The rerank step is slow but only runs on a handful of candidates. Cohere's Rerank, BGE-Reranker, and Voyage Rerank are the 2024 reference implementations.

8. Source code

A minimum-viable RAG in NumPy. No vector DB, no fancy chunking — just enough to show the pieces.

RAG · index + query

import numpy as np

def normalize(x, axis=-1):
    return x / (np.linalg.norm(x, axis=axis, keepdims=True) + 1e-9)

def build_index(passages, embed_fn):
    # 1. Chunk (here we assume passages are already chunks).
    # 2. Embed each chunk.
    vecs = np.stack([embed_fn(p) for p in passages])
    vecs = normalize(vecs)
    return {"texts": passages, "vecs": vecs}

def retrieve(index, query, embed_fn, k=3):
    q = normalize(embed_fn(query))
    scores = index["vecs"] @ q                     # cosine sim
    top = np.argsort(-scores)[:k]
    return [(index["texts"][i], float(scores[i])) for i in top]

def rag_answer(llm, index, query, embed_fn, k=3):
    ctx = retrieve(index, query, embed_fn, k)
    prompt = "Answer using only the context below.\n\n"
    for i, (t, s) in enumerate(ctx, 1):
        prompt += f"[{i}] {t}\n\n"
    prompt += f"Question: {query}\nAnswer:"
    return llm(prompt)

import torch
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import AutoModelForCausalLM, AutoTokenizer

embedder = SentenceTransformer("BAAI/bge-base-en-v1.5")
reranker = CrossEncoder("BAAI/bge-reranker-base")

# Index
passages = [p for p in my_corpus]
vecs = embedder.encode(passages, normalize_embeddings=True,
                         convert_to_tensor=True)    # (M, d)

def query(q, k=5, pool=50):
    # 1. Retrieve the top `pool` candidates by cosine
    qv = embedder.encode(q, normalize_embeddings=True, convert_to_tensor=True)
    sims = (vecs @ qv).cpu().numpy()
    cand_idx = sims.argsort()[::-1][:pool]

    # 2. Rerank those with the cross-encoder
    pairs = [(q, passages[i]) for i in cand_idx]
    scores = reranker.predict(pairs)
    order  = scores.argsort()[::-1][:k]
    return [passages[cand_idx[i]] for i in order]

-- pgvector: Postgres extension for dense retrieval

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE docs (
  id      bigserial PRIMARY KEY,
  text    text,
  embed   vector(768)
);

-- Index for fast top-k lookup
CREATE INDEX ON docs USING hnsw (embed vector_cosine_ops);

-- Insert after embedding in Python
INSERT INTO docs (text, embed) VALUES
  ('The user can reset their password from...', '[0.12, -0.03, ...]');

-- Query: pass in the query embedding as a parameter
SELECT text, 1 - (embed <=> :qvec) AS sim
FROM docs
ORDER BY embed <=> :qvec
LIMIT 5;

9. Summary

RAG keeps the base model frozen and injects relevant context into the prompt at query time. It's the default way to plug private data into an LLM.
The pipeline splits into indexing (chunk → embed → store) and querying (embed → search → prompt → generate).
Dense embeddings trained with contrastive loss are the retrieval engine. They match on semantics, not keywords.
Hybrid search (dense + BM25) and rerankers (cross-encoders on the top candidates) close the gap between "good enough" and "production quality."
Long-context models don't kill RAG — corpora keep getting bigger faster than context windows grow. For retrieval in a 1TB codebase, context-stuffing is not an option.
2024–25 trends: GraphRAG (traverses a knowledge graph), agentic retrieval (the LLM plans its own multi-hop queries), and HyDE (generate a hypothetical answer, embed that, retrieve on it).