1What is the primary purpose of the reranking step in a RAG pipeline?
A modern interview walkthrough for designing a Retrieval-Augmented Generation (RAG) search system. Covers document ingestion, chunking strategies, embedding models, vector databases, hybrid search, reranking, and LLM-generated answers with citations.
Retrieval-Augmented Generation (RAG) has become one of the most commonly asked system design interview questions since 2024, reflecting the rapid adoption of large language models in production systems. RAG combines information retrieval with LLM-powered generation to answer questions grounded in a specific knowledge base, eliminating the hallucination problem of pure LLM-based answers by providing citations to source documents.
The architecture splits into two pipelines: ingestion and query. The ingestion pipeline processes source documents through a series of stages. First, documents are parsed from their source format (PDF, HTML, Markdown, database records) into plain text. Next, the text is split into chunks -- contiguous segments of a few hundred tokens each. Chunking strategy is a critical design decision: fixed-size chunks (e.g., 512 tokens with 50-token overlap) are simple and predictable, while semantic chunking (splitting at paragraph or section boundaries using NLP) preserves the coherence of each chunk but produces variable-length segments. Each chunk is then passed through an embedding model (e.g., OpenAI text-embedding-3-small, or open-source models like E5-large or BGE-large) to produce a dense vector representation (typically 768 to 1536 dimensions). The embeddings are stored in a vector database alongside the original chunk text and metadata.
The query pipeline processes user questions in real time. The query text is embedded using the same embedding model used for documents, producing a query vector. This vector is used for approximate nearest neighbor (ANN) search against the vector database, retrieving the top-K most semantically similar chunks (typically K=10-20). ANN search uses algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to find approximate nearest neighbors in sub-millisecond time, even across millions of vectors. The retrieved chunks are then optionally reranked using a cross-encoder model that scores each query-chunk pair more accurately than the embedding similarity alone, reducing the set to the top 3-5 most relevant chunks. Finally, the selected chunks are inserted into a prompt template along with the user's question, and an LLM generates a natural-language answer with citations referencing the source chunks.
Hybrid search is an important advanced pattern. Pure vector search excels at semantic matching (finding conceptually similar content) but can miss exact keyword matches that a traditional BM25 index would catch. Hybrid search runs both a BM25 keyword search and a vector similarity search in parallel, then fuses the results using reciprocal rank fusion (RRF). RRF computes a combined score for each result based on its rank in each search, giving credit to results that appear highly in either or both. This approach handles both the case where the user asks a conceptual question (vector search dominates) and the case where they search for a specific error message or product name (keyword search dominates).
The Research Assistant Analogy
A RAG system works like a research assistant in a library. When you ask a question, the assistant does not try to answer from memory alone (which might be inaccurate or outdated). Instead, they walk to the shelves (vector database), find the most relevant books and pages (retrieval), photocopy the key passages (context), and then write a summary answer that cites each source (generation). If the assistant only used their memory, they might confidently state incorrect facts (hallucination). By grounding every claim in a specific source, the answer is verifiable and trustworthy.
Perplexity AI
Perplexity AI is a RAG-native search engine that answers user queries by searching the web, retrieving relevant pages, and generating a cited answer using an LLM. Their pipeline includes real-time web crawling, content extraction, chunk embedding, and reranking. Each answer includes inline citations that link to the source pages, allowing users to verify every claim. Perplexity demonstrates RAG at search-engine scale with sub-second query latency.
Notion AI
Notion AI implements workspace-scoped RAG that answers questions about a user's own Notion pages, databases, and wikis. The ingestion pipeline continuously indexes workspace content as it changes, using incremental embedding updates rather than re-embedding the entire workspace. Access control is enforced at retrieval time: the vector search only returns chunks from pages the querying user has permission to read, preventing information leakage.
Shopify Sidekick
Shopify Sidekick uses RAG to answer merchant questions about their store data, Shopify documentation, and e-commerce best practices. The system embeds both structured data (product catalogs, order history) and unstructured data (help articles, community forums). A metadata-aware retrieval layer filters chunks by merchant-specific context (their plan tier, enabled features, store category) before reranking, ensuring answers are relevant to the specific merchant's situation.
| Aspect | Description |
|---|---|
| Chunking Granularity | Smaller chunks (128-256 tokens) provide higher retrieval precision because each chunk covers a focused topic, but they lose surrounding context that may be needed for the LLM to generate a coherent answer. Larger chunks (512-1024 tokens) preserve more context but may include irrelevant content that dilutes the embedding and confuses the LLM. |
| Embedding Model: Proprietary vs Open-Source | Proprietary models (OpenAI, Cohere) offer high quality with simple API integration but create vendor lock-in and ongoing per-token costs. Open-source models (E5, BGE, GTE) can be self-hosted for zero marginal cost and full control but require GPU infrastructure for serving and may lag behind proprietary models in quality. |
| Vector DB: Managed vs Self-Hosted | Managed vector databases (Pinecone, Weaviate Cloud) offer operational simplicity and auto-scaling but have higher per-query costs at scale. Self-hosted options (pgvector, Milvus, Qdrant) provide cost efficiency and data sovereignty but require significant operational expertise for sharding, replication, and index tuning. |
| Latency vs Answer Quality | Adding a reranking step improves precision by 15-25% but adds 50-200ms to the query pipeline. Using a larger LLM (GPT-4 vs GPT-3.5) produces better-synthesized answers but increases generation latency from 1-2 seconds to 3-8 seconds. The right trade-off depends on whether the application prioritizes speed or accuracy. |
Notion AI's Incremental RAG Pipeline for Workspace Search
Scenario
Notion needed to provide AI-powered Q&A across user workspaces containing millions of pages of varied content (text documents, databases, embedded files). The initial approach re-embedded the entire workspace nightly, which was slow (hours for large workspaces), expensive (embedding costs), and stale (changes during the day were not searchable until the next morning). Users expected their most recent edits to be immediately searchable.
Solution
Notion implemented an incremental ingestion pipeline triggered by document change events. When a page is created or updated, only that page's chunks are re-embedded and upserted into the vector store. Deleted pages have their chunks removed. The system uses semantic chunking based on Notion's block structure (headings, paragraphs, list items) rather than fixed-size splitting, preserving the natural document structure. A metadata layer associates each chunk with the page's access control list (ACL), enabling permission-filtered retrieval. The query pipeline uses hybrid search (Notion's own keyword index + vector similarity) with reciprocal rank fusion, followed by a lightweight reranker that considers both semantic relevance and recency.
Outcome
Incremental embedding reduced the median indexing delay from 12 hours to under 30 seconds for newly edited content. Embedding costs dropped by 85% because only changed content was re-embedded. The hybrid search approach improved retrieval relevance by 22% over pure vector search, particularly for queries containing specific terms (project names, code references) that exact keyword matching handles better. Permission-filtered retrieval ensured zero information leakage across user boundaries in shared workspaces.
See Interview Walkthrough: RAG-Powered Search in action
Explore system design templates that use interview walkthrough: rag-powered search and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary purpose of the reranking step in a RAG pipeline?
2Why is hybrid search (BM25 + vector similarity) recommended over pure vector search in RAG systems?
3What is the 'faithfulness' metric in the RAGAS evaluation framework?