Vetora logo
🔍Database Families

Search Engines (Elasticsearch, OpenSearch)

Search engines like Elasticsearch and OpenSearch use inverted indexes to provide full-text search, relevance scoring, faceted filtering, and near-real-time analytics over large document collections. Built on Apache Lucene, they power search experiences from code search to log analysis.

Overview

Search engines are specialized databases optimized for full-text search, relevance ranking, and analytical aggregations over large document collections. Unlike relational databases that store data in rows and retrieve it by primary key or indexed column values, search engines build inverted indexes -- data structures that map every unique term to the list of documents containing that term. This inverted index enables sub-second lookups across billions of documents for queries like 'find all documents containing the phrase distributed consensus algorithm,' something that would require a prohibitively slow full-table scan in a relational database.

Elasticsearch, the dominant search engine in the industry, is built on Apache Lucene -- a high-performance, full-featured text search library written in Java. Lucene provides the core indexing and search capabilities: inverted indexes, BM25 relevance scoring (a probabilistic model that ranks documents by term frequency, inverse document frequency, and field length), analyzers that tokenize text into searchable terms, and efficient query execution. Elasticsearch wraps Lucene with a distributed architecture: data is divided into shards (each shard is a Lucene index), shards are distributed across nodes in a cluster, and each shard can have replicas for fault tolerance and read scaling. This architecture enables Elasticsearch to scale horizontally to handle datasets far larger than a single machine's capacity.

The indexing pipeline in Elasticsearch transforms raw text into searchable terms through analyzers, which consist of character filters (e.g., stripping HTML tags), tokenizers (splitting text into individual terms), and token filters (lowercasing, stemming, removing stop words, generating synonyms). The choice of analyzer determines what users can find: a standard analyzer lowercases and tokenizes on whitespace and punctuation, while a language-specific analyzer applies stemming (reducing 'running' and 'ran' to 'run') and removes language-specific stop words. Custom analyzers enable domain-specific search -- for example, a code search analyzer that preserves underscores and dots as part of tokens rather than splitting on them.

Elasticsearch's near-real-time (NRT) indexing means that documents become searchable within 1 second of being indexed (configurable via the refresh_interval). This is achieved by periodically creating new Lucene segments from the in-memory buffer without performing a full commit to disk. For log analysis and observability use cases, this near-real-time behavior means that errors and anomalies are searchable almost immediately after they occur. Elasticsearch also provides a powerful aggregations framework for analytics: terms aggregations (equivalent to GROUP BY), date histogram aggregations (time-series bucketing), percentile aggregations, and nested aggregations that enable multi-dimensional analysis without pre-computation. This combination of full-text search and real-time analytics is why Elasticsearch powers both user-facing search experiences and operational dashboards.

Key Points
  • 1Inverted indexes map every unique term to the list of documents containing it, enabling O(1) term lookups regardless of corpus size. A search for 'distributed' returns all matching document IDs instantly by looking up one entry in the inverted index, rather than scanning every document.
  • 2BM25 relevance scoring ranks documents by combining term frequency (how often the term appears in the document), inverse document frequency (how rare the term is across all documents), and field length normalization. More relevant documents appear first, unlike SQL LIKE which returns unranked matches.
  • 3Analyzers control how text is tokenized and normalized for indexing. The standard analyzer tokenizes on whitespace and lowercases; language analyzers apply stemming and stop-word removal; custom analyzers handle domain-specific text like source code, medical terminology, or product SKUs.
  • 4Shards are the unit of distribution: each shard is an independent Lucene index that can be placed on any node. Primary shards cannot be changed after index creation (without reindexing), so shard count must be planned for maximum expected data size. Each shard has replicas for fault tolerance and read throughput.
  • 5Near-real-time indexing makes documents searchable within 1 second by creating new Lucene segments from the in-memory buffer. The refresh_interval controls this trade-off -- shorter intervals mean fresher search results but higher indexing overhead.
  • 6Aggregations provide SQL-like analytics (GROUP BY, COUNT, AVG, percentiles) directly on indexed data without a separate analytics database. Composite aggregations enable pagination through high-cardinality results, and pipeline aggregations compute derivatives and moving averages over time-series data.
Simple Example

The Book Index Analogy

The index at the back of a textbook is an inverted index. Instead of reading every page to find where 'recursion' is discussed, you look up 'recursion' in the index and find 'pages 42, 87, 153.' The index maps terms to locations, just like Elasticsearch's inverted index maps terms to document IDs. Now imagine the index also ranks results by relevance -- 'recursion: main discussion p.42, brief mention p.87, footnote p.153.' That ranking is what BM25 scoring does: it puts the most relevant documents first based on how prominent and specific the term is in each document.

Real-World Examples

GitHub

GitHub uses Elasticsearch to power code search across hundreds of millions of repositories. When a developer searches for a function name or error message, Elasticsearch searches the inverted index built from source code files. GitHub's custom analyzer preserves code-specific tokens (like method names with dots and underscores) rather than splitting them, and uses language-specific tokenizers for different programming languages. The search cluster handles millions of queries per day with sub-second response times.

Wikipedia

Wikipedia uses Elasticsearch (via the CirrusSearch extension) for full-text search across 60+ million articles in 300+ languages. Each article is indexed with language-specific analyzers that handle stemming, diacritics, and script-specific tokenization. Search results are ranked by BM25 relevance combined with custom boosting factors like article popularity and recency. The system handles thousands of search queries per second while continuously re-indexing article updates.

Netflix

Netflix uses Elasticsearch for centralized log analysis and error investigation across their microservices architecture. Every log line, error trace, and metric event from thousands of microservices is indexed in Elasticsearch, enabling engineers to search for specific error patterns across the entire infrastructure in seconds. Kibana dashboards provide real-time visualization of error rates, latency distributions, and deployment health, with the aggregations framework computing percentiles and anomaly detection in real time.

Trade-Offs
AspectDescription
Search Capability vs Data FreshnessElasticsearch provides near-real-time search (1-second default refresh interval), not real-time. During the refresh window, newly indexed documents are not yet searchable. Reducing refresh_interval increases freshness but adds CPU and I/O overhead. For use cases requiring immediate searchability, consider using the refresh=true parameter on individual index requests (at significant performance cost).
Relevance Quality vs Index ComplexityBetter search relevance requires more sophisticated analyzers, synonym dictionaries, custom boosting logic, and relevance tuning -- all of which add index size, indexing latency, and maintenance burden. A standard analyzer with BM25 provides good-enough search for many use cases. Investing in custom relevance is worthwhile only when search quality directly impacts business metrics.
Horizontal Scalability vs Operational ComplexityElasticsearch scales by adding shards and nodes, but shard management is complex: too few shards limit parallelism, too many create overhead (each shard consumes memory and file descriptors). Shard count is fixed at index creation, so capacity planning must anticipate growth. Index lifecycle management (ILM) automates rollover and deletion but adds configuration complexity.
Search Engine vs Primary DatabaseElasticsearch is not designed as a primary database. It lacks ACID transactions, has weaker durability guarantees than a relational database (data can be lost during a write that has been acknowledged but not yet fsync'd), and does not support relational constraints. Use it as a secondary index alongside a primary data store, with a reindexing pipeline to rebuild the index if needed.
Case Study

GitHub Code Search -- Searching 200 Billion Lines of Code

Scenario

GitHub needed to provide instant code search across hundreds of millions of repositories containing over 200 billion lines of code. Developers expect to search for function names, error messages, and code patterns across the entire public codebase and receive results in under a second. A relational database's LIKE query would take hours to scan this volume. The search system needed language-aware tokenization (treating 'getUserById' as 'get', 'user', 'by', 'id' for camelCase search) and real-time indexing of new code pushes.

Solution

GitHub built their code search on Elasticsearch with custom analyzers designed for source code. A code-specific tokenizer splits on camelCase, snake_case, and dot notation while preserving the original token. Multiple analyzer passes create both exact-match and fuzzy-match entries in the inverted index. Sharding is designed around repository size, with large repositories getting dedicated shards. A CDC (Change Data Capture) pipeline continuously indexes new commits within seconds of a git push, and bulk reindexing processes rebuild the entire index periodically to incorporate analyzer improvements.

Outcome

GitHub's code search returns results across 200+ billion lines of code in under 500ms for most queries. Language-aware tokenization dramatically improved search relevance -- searching for 'getUser' now matches 'getUserById' and 'get_user_by_id.' The near-real-time indexing pipeline ensures that newly pushed code is searchable within seconds. The system handles millions of search queries daily while continuously indexing hundreds of thousands of new commits per hour.

Common Mistakes
  • Using Elasticsearch as a primary database. Elasticsearch does not provide ACID transactions, has weaker durability than relational databases, and does not support primary key constraints or referential integrity. Always maintain a source-of-truth database and treat Elasticsearch as a secondary index that can be rebuilt.
  • Not planning shard count for growth. The number of primary shards is fixed at index creation. Starting with 1 shard and hitting the 50 GB recommended shard size limit later requires reindexing all data. Plan for 20-40 GB per shard and estimate your maximum expected data size.
  • Using default analyzers for domain-specific content. The standard analyzer works for natural language but produces poor results for source code, product SKUs, email addresses, and other structured text. Invest in custom analyzers that match your content type.
  • Running unbounded aggregations on high-cardinality fields. An aggregation like 'terms on user_id' across millions of unique users consumes enormous memory and can crash nodes. Use composite aggregations for pagination or pre-aggregate data before indexing.
Related Concepts

See Search Engines (Elasticsearch, OpenSearch) in action

Explore system design templates that use search engines (elasticsearch, opensearch) and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Explore how search indexing affects query latency at scale

Metrics to watch
search_latency_p99index_refresh_ratequery_throughput
Run Simulation
Test Your Understanding

1What is the primary data structure that makes full-text search efficient in Elasticsearch?

2Why should Elasticsearch NOT be used as a primary database?

3What does the 'near-real-time' indexing behavior in Elasticsearch mean?

Deeper Reading