Vetora logo
🔍AI / ML Infrastructure

RAG Architecture

Retrieval-Augmented Generation (RAG) grounds LLM responses in external knowledge by retrieving relevant documents at query time and including them in the prompt context. It combines the fluency of generative models with the accuracy and recency of a searchable knowledge base, without the cost and latency of fine-tuning.

Overview

RAG was introduced in a 2020 paper by Facebook AI Research (Lewis et al.) as a way to give language models access to external knowledge without baking it into the model's weights. The core idea is simple: before the LLM generates a response, retrieve relevant documents from a knowledge base and include them in the prompt. This gives the model access to information that may be more recent, more accurate, or more domain-specific than what it learned during pretraining.

The RAG pipeline has three stages. First, offline indexing: documents are split into chunks (typically 200-1000 tokens), each chunk is embedded into a dense vector using an embedding model (OpenAI text-embedding-3, Cohere embed-v3, or open-source models like BGE/E5), and the vectors are stored in a vector database (Pinecone, Weaviate, Qdrant, pgvector, or Elasticsearch with dense vector support). Second, online retrieval: the user's query is embedded using the same embedding model, and the vector database returns the top-K most similar chunks via approximate nearest neighbor (ANN) search. Third, generation: the retrieved chunks are formatted into the LLM's prompt (typically as a 'context' section before the user's question), and the LLM generates a response grounded in this context.

Chunking strategy is the most underestimated design decision in RAG. Naive fixed-size chunking (split every 500 tokens) breaks documents at arbitrary points, potentially splitting a critical paragraph across two chunks. Semantic chunking (split at paragraph or section boundaries) preserves coherence but produces variable-size chunks. Hierarchical chunking (store both parent sections and child paragraphs, retrieve at the child level, expand to the parent for context) balances precision and coherence. Overlapping chunks (50-100 token overlap) ensure information at chunk boundaries is not lost. The optimal strategy depends on the document type: code needs function-level chunks, legal contracts need clause-level, and research papers need section-level.

Retrieval quality is the bottleneck. Pure semantic search (vector similarity) excels at finding conceptually related content but misses exact keyword matches ('error code ERR_42'). Pure keyword search (BM25) finds exact matches but misses semantic relationships ('car' vs. 'automobile'). Hybrid search combines both: run BM25 and vector search in parallel, then merge results using reciprocal rank fusion (RRF) or a cross-encoder re-ranker. In production RAG systems, hybrid search with re-ranking typically improves answer quality by 15-30% over vector-only search. Additional retrieval enhancements include metadata filtering (restrict search to documents from a specific source or date range), query expansion (rewrite the user query into multiple search queries), and multi-hop retrieval (use initial results to formulate follow-up queries for complex questions).

Key Points
  • 1RAG decouples knowledge from the model: updating the knowledge base does not require retraining or fine-tuning the LLM. This makes it the most cost-effective way to give an LLM access to proprietary, recent, or frequently changing information.
  • 2Chunking strategy dramatically affects retrieval quality. Fixed-size chunks are simple but break semantic boundaries. Semantic chunking (paragraph/section boundaries), hierarchical chunking (parent-child), and overlapping chunks each suit different document types. Chunk size of 256-512 tokens with 50-token overlap is a reasonable default.
  • 3Hybrid search (vector similarity + BM25 keyword search) with cross-encoder re-ranking improves answer quality 15-30% over vector-only retrieval. Vector search finds semantically similar content; BM25 finds exact keyword matches. Reciprocal rank fusion merges the result lists.
  • 4The context window is a finite resource. Stuffing too many retrieved chunks wastes tokens and can confuse the LLM. Retrieve 10-20 candidates, re-rank them, and include only the top 3-5 most relevant chunks in the prompt. A well-chosen 2,000-token context outperforms a noisy 10,000-token context.
  • 5Hallucination despite retrieval occurs when the LLM ignores or misinterprets the retrieved context. Mitigations include citing sources (ask the LLM to quote the passage it based its answer on), chain-of-thought reasoning (force the LLM to reason over the retrieved text step by step), and answer validation (a second LLM call checks whether the answer is supported by the retrieved context).
  • 6Embedding model choice matters as much as LLM choice. Domain-specific embedding models (fine-tuned on your document corpus) improve retrieval accuracy by 10-20% over general-purpose embeddings. The embedding model and chunking strategy should be evaluated independently of the LLM using retrieval metrics (recall@K, MRR).
Simple Example

Customer Support RAG System

A SaaS company's customer support bot uses RAG to answer questions from a 5,000-page documentation corpus. When a user asks 'How do I reset my API key?', the query is embedded and the vector database retrieves the 3 most relevant documentation chunks (from the 'API Key Management' page). These chunks are injected into the LLM prompt: 'Based on the following documentation, answer the user question: [chunks] Question: How do I reset my API key?' The LLM generates a step-by-step answer grounded in the actual docs, with a link to the source page. Without RAG, the LLM might hallucinate a procedure that does not match the product's actual UI.

Real-World Examples

OpenAI (ChatGPT with Browse & File Search)

ChatGPT's file search and retrieval capabilities use RAG under the hood. When a user uploads documents or enables browsing, the system chunks the content, generates embeddings, and retrieves relevant passages to ground the LLM's responses. OpenAI's Assistants API exposes RAG as a first-class feature: developers upload files, and the API automatically handles chunking, embedding, vector storage, and retrieval with hybrid search.

Notion AI

Notion AI uses RAG to answer questions about a user's workspace. When a user asks 'What was decided in last week's product meeting?', the system retrieves relevant pages and blocks from the user's Notion workspace using a combination of semantic search and metadata filtering (page type, date, workspace). Retrieval respects Notion's permission model -- the LLM only sees content the user has access to. The system uses a multi-stage retrieval pipeline: coarse vector search, then fine-grained re-ranking with a cross-encoder.

Perplexity AI

Perplexity is an AI search engine built entirely on RAG. For each user query, it performs web search (retrieving and scraping relevant pages), chunks the retrieved content, re-ranks the chunks for relevance, and generates a cited answer. Perplexity's key innovation is its citation model: each sentence in the response is linked to its source chunk, enabling users to verify claims. The system uses query expansion (generating multiple search queries from the user's question) and multi-hop retrieval (using initial results to refine subsequent queries).

Trade-Offs
AspectDescription
RAG vs. Fine-TuningRAG keeps knowledge external and updatable without retraining (add a document, it is immediately searchable). Fine-tuning bakes knowledge into model weights, requiring retraining for updates but achieving better performance on domain-specific language and reasoning patterns. For factual knowledge that changes frequently, use RAG. For domain-specific style, terminology, or reasoning, fine-tune. Many production systems use both.
Retrieval Precision vs. RecallRetrieving more chunks (high recall) increases the chance of including the relevant passage but adds noise and costs more tokens. Retrieving fewer chunks (high precision) reduces noise but risks missing the answer. Re-ranking (retrieve many, then score with a cross-encoder to select the best) balances both, at the cost of an additional model inference step (~50-100ms).
Chunk Size: Small vs. LargeSmaller chunks (128-256 tokens) improve retrieval precision (each chunk is about one idea) but may lack surrounding context needed for the LLM to generate a complete answer. Larger chunks (512-1024 tokens) provide more context but reduce retrieval precision and consume more of the context window. Hierarchical chunking (retrieve small, expand to parent) is the best compromise.
Latency vs. Answer QualityEach RAG enhancement (hybrid search, re-ranking, query expansion, multi-hop retrieval) improves answer quality but adds latency. A simple vector search + LLM call takes ~500ms. Adding re-ranking adds ~100ms, query expansion adds ~300ms (another LLM call), and multi-hop doubles total latency. Production systems must balance quality against user-facing latency targets.
Case Study

Stripe's Internal Knowledge RAG System

Scenario

Stripe engineers spent significant time searching internal documentation, Confluence pages, code documentation, and Slack threads to find answers to questions about internal systems, APIs, and processes. Information was scattered across 50,000+ documents in multiple systems, and search quality was poor because keyword search missed semantically relevant results.

Solution

Stripe built an internal RAG-powered assistant that indexes all internal documentation, code comments, and curated Slack threads. The system uses semantic chunking (paragraph-level for docs, function-level for code), hybrid search (vector + BM25 with reciprocal rank fusion), and a cross-encoder re-ranker to select the top 5 chunks. The LLM generates answers with inline citations linking to source documents. A feedback mechanism (thumbs up/down) drives continuous improvement of the retrieval pipeline.

Outcome

Engineering time spent searching for internal information decreased by 35%. Answer accuracy (measured by user thumbs-up rate) reached 78%, compared to 45% for the previous keyword search. The system handles 10,000+ queries per day. The most impactful improvement was hybrid search with re-ranking, which increased retrieval recall@5 from 62% (vector-only) to 84%.

Common Mistakes
  • Using fixed-size chunking without considering document structure. Splitting a document every 500 tokens breaks paragraphs, code blocks, and tables at arbitrary points, producing chunks that are incoherent on their own. Use semantic chunking (split at section/paragraph boundaries) and preserve structural elements like code blocks, tables, and lists as atomic chunks.
  • Relying on vector search alone without keyword search. Semantic embeddings excel at finding conceptually similar content but fail on exact matches (error codes, product names, API endpoints). Hybrid search (BM25 + vector) with reciprocal rank fusion catches both semantic similarity and exact keyword matches, improving retrieval quality 15-30%.
  • Stuffing the entire context window with retrieved chunks. Including 20 chunks in the prompt overwhelms the LLM, increases latency and cost, and often degrades answer quality because the model struggles to find the relevant passage among noise. Retrieve broadly (top 20), re-rank with a cross-encoder, and include only the top 3-5 chunks.
  • No evaluation pipeline for retrieval quality. Teams often evaluate only the final LLM answer, ignoring whether the retrieval step found the right documents. If retrieval fails, no amount of prompt engineering will fix the answer. Build a retrieval evaluation set (question + relevant document pairs) and measure recall@K and MRR independently of the generation step.
Related Concepts

See RAG Architecture in action

Explore system design templates that use rag architecture and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate RAG retrieval + generation latency pipeline

Metrics to watch
retrieval_latency_msgeneration_latency_mscontext_window_tokensrelevance_score
Run Simulation
Test Your Understanding

1Why is hybrid search (vector + BM25) generally better than vector-only search for RAG?

2What is the main advantage of RAG over fine-tuning for incorporating new knowledge into an LLM?

Deeper Reading