1Why is hybrid search (vector + BM25) generally better than vector-only search for RAG?
Retrieval-Augmented Generation (RAG) grounds LLM responses in external knowledge by retrieving relevant documents at query time and including them in the prompt context. It combines the fluency of generative models with the accuracy and recency of a searchable knowledge base, without the cost and latency of fine-tuning.
RAG was introduced in a 2020 paper by Facebook AI Research (Lewis et al.) as a way to give language models access to external knowledge without baking it into the model's weights. The core idea is simple: before the LLM generates a response, retrieve relevant documents from a knowledge base and include them in the prompt. This gives the model access to information that may be more recent, more accurate, or more domain-specific than what it learned during pretraining.
The RAG pipeline has three stages. First, offline indexing: documents are split into chunks (typically 200-1000 tokens), each chunk is embedded into a dense vector using an embedding model (OpenAI text-embedding-3, Cohere embed-v3, or open-source models like BGE/E5), and the vectors are stored in a vector database (Pinecone, Weaviate, Qdrant, pgvector, or Elasticsearch with dense vector support). Second, online retrieval: the user's query is embedded using the same embedding model, and the vector database returns the top-K most similar chunks via approximate nearest neighbor (ANN) search. Third, generation: the retrieved chunks are formatted into the LLM's prompt (typically as a 'context' section before the user's question), and the LLM generates a response grounded in this context.
Chunking strategy is the most underestimated design decision in RAG. Naive fixed-size chunking (split every 500 tokens) breaks documents at arbitrary points, potentially splitting a critical paragraph across two chunks. Semantic chunking (split at paragraph or section boundaries) preserves coherence but produces variable-size chunks. Hierarchical chunking (store both parent sections and child paragraphs, retrieve at the child level, expand to the parent for context) balances precision and coherence. Overlapping chunks (50-100 token overlap) ensure information at chunk boundaries is not lost. The optimal strategy depends on the document type: code needs function-level chunks, legal contracts need clause-level, and research papers need section-level.
Retrieval quality is the bottleneck. Pure semantic search (vector similarity) excels at finding conceptually related content but misses exact keyword matches ('error code ERR_42'). Pure keyword search (BM25) finds exact matches but misses semantic relationships ('car' vs. 'automobile'). Hybrid search combines both: run BM25 and vector search in parallel, then merge results using reciprocal rank fusion (RRF) or a cross-encoder re-ranker. In production RAG systems, hybrid search with re-ranking typically improves answer quality by 15-30% over vector-only search. Additional retrieval enhancements include metadata filtering (restrict search to documents from a specific source or date range), query expansion (rewrite the user query into multiple search queries), and multi-hop retrieval (use initial results to formulate follow-up queries for complex questions).
Customer Support RAG System
A SaaS company's customer support bot uses RAG to answer questions from a 5,000-page documentation corpus. When a user asks 'How do I reset my API key?', the query is embedded and the vector database retrieves the 3 most relevant documentation chunks (from the 'API Key Management' page). These chunks are injected into the LLM prompt: 'Based on the following documentation, answer the user question: [chunks] Question: How do I reset my API key?' The LLM generates a step-by-step answer grounded in the actual docs, with a link to the source page. Without RAG, the LLM might hallucinate a procedure that does not match the product's actual UI.
OpenAI (ChatGPT with Browse & File Search)
ChatGPT's file search and retrieval capabilities use RAG under the hood. When a user uploads documents or enables browsing, the system chunks the content, generates embeddings, and retrieves relevant passages to ground the LLM's responses. OpenAI's Assistants API exposes RAG as a first-class feature: developers upload files, and the API automatically handles chunking, embedding, vector storage, and retrieval with hybrid search.
Notion AI
Notion AI uses RAG to answer questions about a user's workspace. When a user asks 'What was decided in last week's product meeting?', the system retrieves relevant pages and blocks from the user's Notion workspace using a combination of semantic search and metadata filtering (page type, date, workspace). Retrieval respects Notion's permission model -- the LLM only sees content the user has access to. The system uses a multi-stage retrieval pipeline: coarse vector search, then fine-grained re-ranking with a cross-encoder.
Perplexity AI
Perplexity is an AI search engine built entirely on RAG. For each user query, it performs web search (retrieving and scraping relevant pages), chunks the retrieved content, re-ranks the chunks for relevance, and generates a cited answer. Perplexity's key innovation is its citation model: each sentence in the response is linked to its source chunk, enabling users to verify claims. The system uses query expansion (generating multiple search queries from the user's question) and multi-hop retrieval (using initial results to refine subsequent queries).
| Aspect | Description |
|---|---|
| RAG vs. Fine-Tuning | RAG keeps knowledge external and updatable without retraining (add a document, it is immediately searchable). Fine-tuning bakes knowledge into model weights, requiring retraining for updates but achieving better performance on domain-specific language and reasoning patterns. For factual knowledge that changes frequently, use RAG. For domain-specific style, terminology, or reasoning, fine-tune. Many production systems use both. |
| Retrieval Precision vs. Recall | Retrieving more chunks (high recall) increases the chance of including the relevant passage but adds noise and costs more tokens. Retrieving fewer chunks (high precision) reduces noise but risks missing the answer. Re-ranking (retrieve many, then score with a cross-encoder to select the best) balances both, at the cost of an additional model inference step (~50-100ms). |
| Chunk Size: Small vs. Large | Smaller chunks (128-256 tokens) improve retrieval precision (each chunk is about one idea) but may lack surrounding context needed for the LLM to generate a complete answer. Larger chunks (512-1024 tokens) provide more context but reduce retrieval precision and consume more of the context window. Hierarchical chunking (retrieve small, expand to parent) is the best compromise. |
| Latency vs. Answer Quality | Each RAG enhancement (hybrid search, re-ranking, query expansion, multi-hop retrieval) improves answer quality but adds latency. A simple vector search + LLM call takes ~500ms. Adding re-ranking adds ~100ms, query expansion adds ~300ms (another LLM call), and multi-hop doubles total latency. Production systems must balance quality against user-facing latency targets. |
Stripe's Internal Knowledge RAG System
Scenario
Stripe engineers spent significant time searching internal documentation, Confluence pages, code documentation, and Slack threads to find answers to questions about internal systems, APIs, and processes. Information was scattered across 50,000+ documents in multiple systems, and search quality was poor because keyword search missed semantically relevant results.
Solution
Stripe built an internal RAG-powered assistant that indexes all internal documentation, code comments, and curated Slack threads. The system uses semantic chunking (paragraph-level for docs, function-level for code), hybrid search (vector + BM25 with reciprocal rank fusion), and a cross-encoder re-ranker to select the top 5 chunks. The LLM generates answers with inline citations linking to source documents. A feedback mechanism (thumbs up/down) drives continuous improvement of the retrieval pipeline.
Outcome
Engineering time spent searching for internal information decreased by 35%. Answer accuracy (measured by user thumbs-up rate) reached 78%, compared to 45% for the previous keyword search. The system handles 10,000+ queries per day. The most impactful improvement was hybrid search with re-ranking, which increased retrieval recall@5 from 62% (vector-only) to 84%.
See RAG Architecture in action
Explore system design templates that use rag architecture and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why is hybrid search (vector + BM25) generally better than vector-only search for RAG?
2What is the main advantage of RAG over fine-tuning for incorporating new knowledge into an LLM?