Search Engines: How Full-Text Search Actually Works
Your e-commerce site has a search box. Users type “blue running shoes size 10.” Your database query is WHERE description LIKE '%blue running shoes%'. It returns nothing because the description says “navy athletic footwear.” Users leave. They go to a competitor whose search understands that “blue” and “navy” are related, that “running shoes” and “athletic footwear” are synonyms, and that “size 10” should filter results.
This is the gap between database queries and search engines. A database finds exact matches. A search engine finds relevant matches.
How search engines work: the inverted index
The core data structure of every search engine is the inverted index. Instead of storing documents and their contents, an inverted index stores terms and which documents contain them.
Forward index (what a database does):
- Document 1: “blue running shoes”
- Document 2: “navy athletic footwear”
- Document 3: “blue casual sneakers”
Inverted index (what a search engine does):
- “blue” -> [Document 1, Document 3]
- “running” -> [Document 1]
- “shoes” -> [Document 1]
- “navy” -> [Document 2]
- “athletic” -> [Document 2]
- “footwear” -> [Document 2]
- “casual” -> [Document 3]
- “sneakers” -> [Document 3]
To find documents containing “blue shoes”: look up “blue” (Documents 1, 3), look up “shoes” (Document 1), intersect: Document 1.
This is O(1) per term lookup, regardless of how many documents exist. A database LIKE '%blue%' scans every row.
graph TB subgraph indexing["Indexing Pipeline"] DOC["Raw document 'Blue Running Shoes for Men'"] TOK["Tokenizer ['Blue', 'Running', 'Shoes', 'for', 'Men']"] NORM["Normalizer ['blue', 'running', 'shoe', 'man']"] IDX["Inverted Index blue -> [doc1] running -> [doc1] shoe -> [doc1]"] DOC --> TOK --> NORM --> IDX end subgraph query["Query Pipeline"] Q["Query: 'blue shoes'"] QTOK["Tokenize + normalize ['blue', 'shoe']"] LOOKUP["Index lookup blue -> [doc1, doc3] shoe -> [doc1, doc5]"] INTERSECT["Intersect [doc1]"] SCORE["Score and rank doc1: 0.85"] Q --> QTOK --> LOOKUP --> INTERSECT --> SCORE end style IDX fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style SCORE fill:#E1F5EE,stroke:#0F6E56,color:#085041
Text analysis: turning words into index terms
Before indexing, text goes through an analysis pipeline:
Tokenization - Split text into tokens. “Blue running shoes” becomes [“Blue”, “running”, “shoes”]. Handle punctuation, hyphens, special characters.
Lowercasing - “Blue” and “blue” should match. Convert all tokens to lowercase.
Stop word removal - Remove common words that add no search value: “the”, “a”, “is”, “for”. These would appear in almost every document and make the index huge without helping relevance.
Stemming - Reduce words to their root form. “running” -> “run”, “shoes” -> “shoe”, “athletic” -> “athlet”. Now “running” and “runner” match the same documents.
Synonyms - Expand terms to include synonyms. “navy” -> [“navy”, “blue”, “dark blue”]. “footwear” -> [“footwear”, “shoes”, “boots”].
The same analysis pipeline runs on both the indexed documents and the search query. This ensures consistent matching.
Relevance scoring: BM25
Finding documents that contain the search terms is the easy part. Ranking them by relevance is the hard part.
The standard algorithm is BM25 (Best Match 25), which considers:
Term frequency (TF) - How many times does the search term appear in the document? A document mentioning “shoes” 10 times is more relevant than one mentioning it once. But the relationship is sublinear - 10 mentions is not 10x more relevant than 1 mention.
Inverse document frequency (IDF) - How rare is the term across all documents? “shoes” appearing in 90% of documents is less informative than “orthopedic” appearing in 1% of documents. Rare terms are more discriminating.
Document length normalization - A short document with one mention of “shoes” is more relevant than a long document with one mention buried in thousands of words.
BM25 combines these factors into a score. Documents are ranked by score. The top N are returned.
Where it breaks or gets interesting
Phrase queries
The inverted index finds documents containing individual terms. But “New York” should match the phrase, not documents that contain “new” and “york” separately (like “new product from York, England”).
Phrase queries require storing position information in the index: not just which documents contain a term, but where in the document. “New York” as a phrase requires “new” at position N and “york” at position N+1 in the same document.
Fuzzy matching
Users make typos. “sheos” should match “shoes.” Fuzzy matching uses edit distance (Levenshtein distance) to find terms within N edits of the query term. Elasticsearch supports fuzziness: AUTO which automatically determines the allowed edit distance based on term length.
Faceted search
E-commerce search needs facets: “filter by brand,” “filter by price range,” “filter by color.” These are not full-text queries - they are structured filters on indexed fields. Search engines support this through field-level indexing and aggregations.
The freshness problem
New documents need to be indexed before they appear in search results. Indexing is not instant. Elasticsearch uses a refresh interval (default 1 second) to make new documents searchable. During the refresh interval, new documents are not visible.
For near-real-time requirements, reduce the refresh interval. For batch indexing, increase it to improve throughput.
Distributed search
A single search index has limits. Elasticsearch distributes the index across shards. Each shard is an independent Lucene index. Queries run on all shards in parallel and results are merged.
The challenge: relevance scores are computed per-shard. A term that is rare globally might be common in one shard, giving it a lower IDF score on that shard. This can cause relevance inconsistencies. Solutions: use a global IDF (query all shards for IDF statistics before scoring), or use enough data per shard that local statistics approximate global statistics.
graph LR subgraph distributed["Distributed Search"] Q2["Query: 'blue shoes'"] COORD["Coordinator node"] SH1["Shard 1 10M docs"] SH2["Shard 2 10M docs"] SH3["Shard 3 10M docs"] MERGE["Merge and rank top 10 results"] Q2 --> COORD COORD --> SH1 COORD --> SH2 COORD --> SH3 SH1 -->|"top 10 from shard"| MERGE SH2 -->|"top 10 from shard"| MERGE SH3 -->|"top 10 from shard"| MERGE end style COORD fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style SH1 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style SH2 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style SH3 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style MERGE fill:#FAEEDA,stroke:#854F0B,color:#633806
Real-world systems
Elasticsearch - The most widely used search engine. Built on Apache Lucene. Distributed, RESTful API, supports full-text search, structured queries, aggregations, and geospatial queries. Used by GitHub (code search), Wikipedia, Netflix, Uber.
Apache Solr - Also built on Lucene. More mature than Elasticsearch, more complex to configure. Used by many enterprise applications.
Typesense - Modern, fast, typo-tolerant search engine. Simpler to operate than Elasticsearch. Good for product search and autocomplete.
Meilisearch - Open-source, easy to set up, excellent for small to medium datasets. Instant search with typo tolerance.
Algolia - Hosted search-as-a-service. Extremely fast (sub-10ms). Good developer experience. Expensive at scale.
PostgreSQL full-text search - Built-in full-text search using GIN indexes on tsvector columns. Good for moderate search requirements without a separate search service.
How to apply it in practice
When to use a dedicated search engine
Use Elasticsearch or similar when:
- You need full-text search with relevance ranking
- You need faceted search (filters + aggregations)
- You need fuzzy matching and typo tolerance
- Your search index is large (millions of documents)
- You need near-real-time indexing
Use PostgreSQL full-text search when:
- Your dataset is small to medium (under 1 million documents)
- You want to avoid operational complexity of a separate service
- Your search requirements are basic (no complex relevance tuning)
The sync problem
Your primary data is in PostgreSQL. Your search index is in Elasticsearch. They must stay in sync. Options:
Dual write - Write to both on every change. Risk: partial failures leave them out of sync.
Change data capture (CDC) - Stream database changes (via Debezium reading PostgreSQL WAL) to Elasticsearch. Reliable, eventually consistent.
Periodic re-index - Rebuild the search index from the database on a schedule. Simple but introduces lag.
Event-driven - Publish events on data changes, consume them to update the search index. Decoupled but requires an event bus.
Relevance tuning
Default BM25 relevance is a starting point. Tune it for your use case:
- Boost fields - Title matches are more relevant than body matches. Boost the title field.
- Boost recent documents - For news or products, newer items should rank higher.
- Synonyms - Add domain-specific synonyms (product names, abbreviations).
- Stopwords - Add domain-specific stopwords (your product name should not be a stopword).
- A/B test - Measure click-through rate and conversion rate for different relevance configurations.
FAQ
Q: What is the difference between search and filtering?
Filtering is exact matching: WHERE price < 100 AND category = 'shoes'. It is deterministic - a document either matches or it does not. Search is relevance matching: find documents most relevant to “comfortable running shoes.” It is probabilistic - documents are ranked by relevance score. In practice, search engines combine both: full-text search for relevance ranking, plus structured filters for exact matching. Elasticsearch calls these “query” (relevance) and “filter” (exact match, cached).
Q: How does autocomplete work?
Autocomplete requires a different index than full-text search. Common approaches: edge n-gram tokenization (index “shoe” as “s”, “sh”, “sho”, “shoe” so any prefix matches), completion suggester (Elasticsearch has a dedicated data structure for this), or a separate prefix index. The key requirement is speed: autocomplete must return results in under 50ms. This often means a separate, smaller index optimized for prefix lookups, not the full search index.
Q: How do you handle search in multiple languages?
Each language needs its own analyzer: different tokenization rules, different stop words, different stemming algorithms. Elasticsearch supports language-specific analyzers (english, french, german, etc.). For multilingual content, either store the language of each document and use the appropriate analyzer, or use a language-agnostic analyzer (ICU tokenizer) that handles multiple languages. For search queries, detect the query language and use the appropriate analyzer.
Interview questions
Q1: Design the search feature for an e-commerce site with 10 million products. Users search by name, description, and filter by category, price, and brand.
Strong answer: Use Elasticsearch. Index each product as a document with fields: name (text, boosted), description (text), category (keyword, for filtering), brand (keyword, for filtering and facets), price (float, for range filtering and sorting), tags (keyword array). Use a custom analyzer with synonyms for the product domain. For the query: combine a multi-match query on name and description (relevance) with term filters on category and brand (exact match, cached). Use aggregations for facet counts (how many products in each category, brand). For autocomplete: a separate completion suggester index on product names. For relevance tuning: boost name matches over description matches, boost products with higher sales rank. Sync from PostgreSQL using CDC (Debezium) to keep the index up to date.
Q2: Your Elasticsearch cluster is slow. Queries that used to take 50ms now take 2 seconds. What do you investigate?
Strong answer: Start with the slow log and hot threads API to identify which queries are slow and what the cluster is doing. Check cluster health: are all shards assigned? Any red or yellow status? Check JVM heap usage: if heap is above 75%, GC pressure is causing pauses. Check shard count: too many small shards (over 20 per GB of heap) causes overhead. Check if queries are doing expensive operations: wildcard queries, script scoring, large aggregations. Check if the index has too many fields (mapping explosion). Check disk I/O: if the OS page cache is not warm (after a restart), queries are slow until data is cached. For immediate relief: increase JVM heap, reduce shard count by merging small shards, add replicas to distribute query load. Long-term: optimize queries, tune the data model, add more nodes.
Q3: How would you implement “did you mean” (spell correction) in a search engine?
Strong answer: Spell correction requires knowing what the correct terms are. Two approaches. First, dictionary-based: maintain a dictionary of valid terms (from your product catalog, user queries that returned results). When a query term is not in the dictionary, find the closest term by edit distance (Levenshtein). Elasticsearch’s suggest API does this. Second, frequency-based: collect all search queries and their result counts. If “sheos” returns 0 results but “shoes” returns 10,000, suggest “shoes.” This is how Google’s “did you mean” works - it is based on what other users searched for, not a dictionary. For implementation: use Elasticsearch’s phrase suggester for context-aware suggestions (it considers the surrounding words, not just the individual term). Show the suggestion only when the original query returns few or no results.