RAG

[!note] Test note. RAG = solving the “LLM doesn’t know your private data” problem without fine-tuning.

The problem RAG solves

LLMs have a knowledge cutoff and no access to private data. Options:

  1. Fine-tuning — expensive, doesn’t update easily, doesn’t scale to dynamic data
  2. Long context — works but slow + expensive at query time
  3. RAG — retrieve relevant chunks at query time, stuff into context

RAG wins for dynamic / large knowledge bases.

Architecture

graph LR
    Q[Query] --> E1[Embed query]
    E1 --> VS[(Vector store)]
    VS --> R[Top-k chunks]
    R --> P[Prompt assembly]
    Q --> P
    P --> LLM[LLM]
    LLM --> A[Answer]

Indexing pipeline

graph LR
    D[Documents] --> C[Chunker]
    C --> E2[Embed chunks]
    E2 --> VS2[(Vector store)]

Key decisions at index time:

  • Chunk size — 256–512 tokens typical. Too small = missing context. Too large = noise drowns signal.
  • Overlap — 10–20% overlap between chunks avoids splitting sentences mid-thought.
  • Embedding modeltext-embedding-3-small good default. Multilingual needs multilingual model.

Retrieval strategies

Strategy When to use
Dense (vector similarity) Semantic queries, natural language
Sparse (BM25/keyword) Exact matches, IDs, codes
Hybrid (dense + sparse) Best of both — use RRF to merge
Re-ranking Run a cross-encoder after top-k to reorder

[!tip] Hybrid retrieval almost always beats pure dense for real-world queries. BM25 catches exact keywords that embeddings miss.

When RAG fails

  • Chunking cuts context — answer spans two chunks, neither chunk alone is enough
  • Query-document mismatch — user asks in one style, docs written in another → use HyDE (generate hypothetical answer, embed that)
  • Irrelevant chunks retrieved — retrieval precision low → add metadata filters, rerank
  • LLM ignores retrieved context — prompt injection issue, or context window too noisy

Key formula

Similarity search uses cosine similarity:

\[\text{sim}(q, d) = \frac{q \cdot d}{\|q\| \|d\|}\]

where $q$ and $d$ are the query and document embedding vectors.


GitHub · RSS