RAG
- The problem RAG solves
- Architecture
- Indexing pipeline
- Retrieval strategies
- When RAG fails
- Key formula
[!note] Test note. RAG = solving the “LLM doesn’t know your private data” problem without fine-tuning.
The problem RAG solves
LLMs have a knowledge cutoff and no access to private data. Options:
- Fine-tuning — expensive, doesn’t update easily, doesn’t scale to dynamic data
- Long context — works but slow + expensive at query time
- RAG — retrieve relevant chunks at query time, stuff into context
RAG wins for dynamic / large knowledge bases.
Architecture
graph LR
Q[Query] --> E1[Embed query]
E1 --> VS[(Vector store)]
VS --> R[Top-k chunks]
R --> P[Prompt assembly]
Q --> P
P --> LLM[LLM]
LLM --> A[Answer]
Indexing pipeline
graph LR
D[Documents] --> C[Chunker]
C --> E2[Embed chunks]
E2 --> VS2[(Vector store)]
Key decisions at index time:
- Chunk size — 256–512 tokens typical. Too small = missing context. Too large = noise drowns signal.
- Overlap — 10–20% overlap between chunks avoids splitting sentences mid-thought.
- Embedding model —
text-embedding-3-smallgood default. Multilingual needs multilingual model.
Retrieval strategies
| Strategy | When to use |
|---|---|
| Dense (vector similarity) | Semantic queries, natural language |
| Sparse (BM25/keyword) | Exact matches, IDs, codes |
| Hybrid (dense + sparse) | Best of both — use RRF to merge |
| Re-ranking | Run a cross-encoder after top-k to reorder |
[!tip] Hybrid retrieval almost always beats pure dense for real-world queries. BM25 catches exact keywords that embeddings miss.
When RAG fails
- Chunking cuts context — answer spans two chunks, neither chunk alone is enough
- Query-document mismatch — user asks in one style, docs written in another → use HyDE (generate hypothetical answer, embed that)
- Irrelevant chunks retrieved — retrieval precision low → add metadata filters, rerank
- LLM ignores retrieved context — prompt injection issue, or context window too noisy
Key formula
Similarity search uses cosine similarity:
\[\text{sim}(q, d) = \frac{q \cdot d}{\|q\| \|d\|}\]where $q$ and $d$ are the query and document embedding vectors.