Vetora logo
🧠Interview Toolkit

Interview Walkthrough: RAG-Powered Search

A modern interview walkthrough for designing a Retrieval-Augmented Generation (RAG) search system. Covers document ingestion, chunking strategies, embedding models, vector databases, hybrid search, reranking, and LLM-generated answers with citations.

Overview

Retrieval-Augmented Generation (RAG) has become one of the most commonly asked system design interview questions since 2024, reflecting the rapid adoption of large language models in production systems. RAG combines information retrieval with LLM-powered generation to answer questions grounded in a specific knowledge base, eliminating the hallucination problem of pure LLM-based answers by providing citations to source documents.

The architecture splits into two pipelines: ingestion and query. The ingestion pipeline processes source documents through a series of stages. First, documents are parsed from their source format (PDF, HTML, Markdown, database records) into plain text. Next, the text is split into chunks -- contiguous segments of a few hundred tokens each. Chunking strategy is a critical design decision: fixed-size chunks (e.g., 512 tokens with 50-token overlap) are simple and predictable, while semantic chunking (splitting at paragraph or section boundaries using NLP) preserves the coherence of each chunk but produces variable-length segments. Each chunk is then passed through an embedding model (e.g., OpenAI text-embedding-3-small, or open-source models like E5-large or BGE-large) to produce a dense vector representation (typically 768 to 1536 dimensions). The embeddings are stored in a vector database alongside the original chunk text and metadata.

The query pipeline processes user questions in real time. The query text is embedded using the same embedding model used for documents, producing a query vector. This vector is used for approximate nearest neighbor (ANN) search against the vector database, retrieving the top-K most semantically similar chunks (typically K=10-20). ANN search uses algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to find approximate nearest neighbors in sub-millisecond time, even across millions of vectors. The retrieved chunks are then optionally reranked using a cross-encoder model that scores each query-chunk pair more accurately than the embedding similarity alone, reducing the set to the top 3-5 most relevant chunks. Finally, the selected chunks are inserted into a prompt template along with the user's question, and an LLM generates a natural-language answer with citations referencing the source chunks.

Hybrid search is an important advanced pattern. Pure vector search excels at semantic matching (finding conceptually similar content) but can miss exact keyword matches that a traditional BM25 index would catch. Hybrid search runs both a BM25 keyword search and a vector similarity search in parallel, then fuses the results using reciprocal rank fusion (RRF). RRF computes a combined score for each result based on its rank in each search, giving credit to results that appear highly in either or both. This approach handles both the case where the user asks a conceptual question (vector search dominates) and the case where they search for a specific error message or product name (keyword search dominates).

Key Points
  • 1The ingestion pipeline (chunk -> embed -> store) runs offline and is the foundation of RAG quality. Poor chunking or an unsuitable embedding model will degrade retrieval accuracy regardless of how good the LLM is.
  • 2Chunking strategy has the largest single impact on retrieval quality. Fixed-size chunks with overlap are robust but may split critical context across boundaries. Semantic chunking preserves paragraph and section coherence but produces uneven chunk sizes.
  • 3Hybrid search (BM25 keyword + vector similarity fused via reciprocal rank fusion) consistently outperforms either search method alone because it captures both exact keyword matches and semantic similarity.
  • 4Cross-encoder reranking dramatically improves precision. Bi-encoder embeddings are fast but coarse; a cross-encoder scores each query-document pair more accurately by attending to the interaction between them, typically improving top-3 precision by 15-25%.
  • 5The LLM generation step must include citation generation: each claim in the answer should reference the specific chunk it was derived from, enabling the user to verify the answer and building trust in the system.
  • 6Evaluation frameworks like RAGAS measure RAG quality across multiple dimensions: faithfulness (is the answer supported by retrieved context?), relevance (are retrieved chunks relevant?), and recall (are all relevant chunks retrieved?).
Simple Example

The Research Assistant Analogy

A RAG system works like a research assistant in a library. When you ask a question, the assistant does not try to answer from memory alone (which might be inaccurate or outdated). Instead, they walk to the shelves (vector database), find the most relevant books and pages (retrieval), photocopy the key passages (context), and then write a summary answer that cites each source (generation). If the assistant only used their memory, they might confidently state incorrect facts (hallucination). By grounding every claim in a specific source, the answer is verifiable and trustworthy.

Real-World Examples

Perplexity AI

Perplexity AI is a RAG-native search engine that answers user queries by searching the web, retrieving relevant pages, and generating a cited answer using an LLM. Their pipeline includes real-time web crawling, content extraction, chunk embedding, and reranking. Each answer includes inline citations that link to the source pages, allowing users to verify every claim. Perplexity demonstrates RAG at search-engine scale with sub-second query latency.

Notion AI

Notion AI implements workspace-scoped RAG that answers questions about a user's own Notion pages, databases, and wikis. The ingestion pipeline continuously indexes workspace content as it changes, using incremental embedding updates rather than re-embedding the entire workspace. Access control is enforced at retrieval time: the vector search only returns chunks from pages the querying user has permission to read, preventing information leakage.

Shopify Sidekick

Shopify Sidekick uses RAG to answer merchant questions about their store data, Shopify documentation, and e-commerce best practices. The system embeds both structured data (product catalogs, order history) and unstructured data (help articles, community forums). A metadata-aware retrieval layer filters chunks by merchant-specific context (their plan tier, enabled features, store category) before reranking, ensuring answers are relevant to the specific merchant's situation.

Trade-Offs
AspectDescription
Chunking GranularitySmaller chunks (128-256 tokens) provide higher retrieval precision because each chunk covers a focused topic, but they lose surrounding context that may be needed for the LLM to generate a coherent answer. Larger chunks (512-1024 tokens) preserve more context but may include irrelevant content that dilutes the embedding and confuses the LLM.
Embedding Model: Proprietary vs Open-SourceProprietary models (OpenAI, Cohere) offer high quality with simple API integration but create vendor lock-in and ongoing per-token costs. Open-source models (E5, BGE, GTE) can be self-hosted for zero marginal cost and full control but require GPU infrastructure for serving and may lag behind proprietary models in quality.
Vector DB: Managed vs Self-HostedManaged vector databases (Pinecone, Weaviate Cloud) offer operational simplicity and auto-scaling but have higher per-query costs at scale. Self-hosted options (pgvector, Milvus, Qdrant) provide cost efficiency and data sovereignty but require significant operational expertise for sharding, replication, and index tuning.
Latency vs Answer QualityAdding a reranking step improves precision by 15-25% but adds 50-200ms to the query pipeline. Using a larger LLM (GPT-4 vs GPT-3.5) produces better-synthesized answers but increases generation latency from 1-2 seconds to 3-8 seconds. The right trade-off depends on whether the application prioritizes speed or accuracy.
Case Study

Notion AI's Incremental RAG Pipeline for Workspace Search

Scenario

Notion needed to provide AI-powered Q&A across user workspaces containing millions of pages of varied content (text documents, databases, embedded files). The initial approach re-embedded the entire workspace nightly, which was slow (hours for large workspaces), expensive (embedding costs), and stale (changes during the day were not searchable until the next morning). Users expected their most recent edits to be immediately searchable.

Solution

Notion implemented an incremental ingestion pipeline triggered by document change events. When a page is created or updated, only that page's chunks are re-embedded and upserted into the vector store. Deleted pages have their chunks removed. The system uses semantic chunking based on Notion's block structure (headings, paragraphs, list items) rather than fixed-size splitting, preserving the natural document structure. A metadata layer associates each chunk with the page's access control list (ACL), enabling permission-filtered retrieval. The query pipeline uses hybrid search (Notion's own keyword index + vector similarity) with reciprocal rank fusion, followed by a lightweight reranker that considers both semantic relevance and recency.

Outcome

Incremental embedding reduced the median indexing delay from 12 hours to under 30 seconds for newly edited content. Embedding costs dropped by 85% because only changed content was re-embedded. The hybrid search approach improved retrieval relevance by 22% over pure vector search, particularly for queries containing specific terms (project names, code references) that exact keyword matching handles better. Permission-filtered retrieval ensured zero information leakage across user boundaries in shared workspaces.

Common Mistakes
  • Treating chunking as a trivial preprocessing step. The choice of chunk size, overlap, and splitting strategy has the single largest impact on retrieval quality. Test multiple chunking strategies empirically before committing to one.
  • Using only vector search without keyword search. Vector embeddings excel at semantic similarity but can miss exact keyword matches (error codes, product names, specific phrases). Hybrid search with BM25 + vector consistently outperforms either alone.
  • Skipping the reranking step to save latency. Bi-encoder embeddings produce coarse similarity scores. A cross-encoder reranker provides much more accurate relevance scores at the cost of 50-200ms, which is usually an acceptable trade-off for higher answer quality.
  • Not evaluating RAG quality systematically. Without metrics like faithfulness, relevance, and recall (RAGAS framework), you cannot tell whether changes to the pipeline improve or degrade answer quality. Establish an evaluation benchmark before iterating on the pipeline.
Related Concepts

See Interview Walkthrough: RAG-Powered Search in action

Explore system design templates that use interview walkthrough: rag-powered search and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Build a RAG-powered search with vector retrieval

Metrics to watch
retrieval_latency_msreranking_latency_msrelevance_scorethroughput_qps
Run Simulation
Test Your Understanding

1What is the primary purpose of the reranking step in a RAG pipeline?

2Why is hybrid search (BM25 + vector similarity) recommended over pure vector search in RAG systems?

3What is the 'faithfulness' metric in the RAGAS evaluation framework?

Deeper Reading