Search Engines: How Full-Text Search Actually Works


Your e-commerce site has a search box. Users type “blue running shoes size 10.” Your database query is WHERE description LIKE '%blue running shoes%'. It returns nothing because the description says “navy athletic footwear.” Users leave. They go to a competitor whose search understands that “blue” and “navy” are related, that “running shoes” and “athletic footwear” are synonyms, and that “size 10” should filter results.

This is the gap between database queries and search engines. A database finds exact matches. A search engine finds relevant matches.

How search engines work: the inverted index

The core data structure of every search engine is the inverted index. Instead of storing documents and their contents, an inverted index stores terms and which documents contain them.

Forward index (what a database does):

  • Document 1: “blue running shoes”
  • Document 2: “navy athletic footwear”
  • Document 3: “blue casual sneakers”

Inverted index (what a search engine does):

  • “blue” -> [Document 1, Document 3]
  • “running” -> [Document 1]
  • “shoes” -> [Document 1]
  • “navy” -> [Document 2]
  • “athletic” -> [Document 2]
  • “footwear” -> [Document 2]
  • “casual” -> [Document 3]
  • “sneakers” -> [Document 3]

To find documents containing “blue shoes”: look up “blue” (Documents 1, 3), look up “shoes” (Document 1), intersect: Document 1.

This is O(1) per term lookup, regardless of how many documents exist. A database LIKE '%blue%' scans every row.

graph TB
subgraph indexing["Indexing Pipeline"]
  DOC["Raw document
'Blue Running Shoes for Men'"]
  TOK["Tokenizer
['Blue', 'Running', 'Shoes', 'for', 'Men']"]
  NORM["Normalizer
['blue', 'running', 'shoe', 'man']"]
  IDX["Inverted Index
blue -> [doc1]
running -> [doc1]
shoe -> [doc1]"]
  DOC --> TOK --> NORM --> IDX
end

subgraph query["Query Pipeline"]
  Q["Query: 'blue shoes'"]
  QTOK["Tokenize + normalize
['blue', 'shoe']"]
  LOOKUP["Index lookup
blue -> [doc1, doc3]
shoe -> [doc1, doc5]"]
  INTERSECT["Intersect
[doc1]"]
  SCORE["Score and rank
doc1: 0.85"]
  Q --> QTOK --> LOOKUP --> INTERSECT --> SCORE
end

style IDX fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style SCORE fill:#E1F5EE,stroke:#0F6E56,color:#085041

Text analysis: turning words into index terms

Before indexing, text goes through an analysis pipeline:

Tokenization - Split text into tokens. “Blue running shoes” becomes [“Blue”, “running”, “shoes”]. Handle punctuation, hyphens, special characters.

Lowercasing - “Blue” and “blue” should match. Convert all tokens to lowercase.

Stop word removal - Remove common words that add no search value: “the”, “a”, “is”, “for”. These would appear in almost every document and make the index huge without helping relevance.

Stemming - Reduce words to their root form. “running” -> “run”, “shoes” -> “shoe”, “athletic” -> “athlet”. Now “running” and “runner” match the same documents.

Synonyms - Expand terms to include synonyms. “navy” -> [“navy”, “blue”, “dark blue”]. “footwear” -> [“footwear”, “shoes”, “boots”].

The same analysis pipeline runs on both the indexed documents and the search query. This ensures consistent matching.

Relevance scoring: BM25

Finding documents that contain the search terms is the easy part. Ranking them by relevance is the hard part.

The standard algorithm is BM25 (Best Match 25), which considers:

Term frequency (TF) - How many times does the search term appear in the document? A document mentioning “shoes” 10 times is more relevant than one mentioning it once. But the relationship is sublinear - 10 mentions is not 10x more relevant than 1 mention.

Inverse document frequency (IDF) - How rare is the term across all documents? “shoes” appearing in 90% of documents is less informative than “orthopedic” appearing in 1% of documents. Rare terms are more discriminating.

Document length normalization - A short document with one mention of “shoes” is more relevant than a long document with one mention buried in thousands of words.

BM25 combines these factors into a score. Documents are ranked by score. The top N are returned.

Where it breaks or gets interesting

Phrase queries

The inverted index finds documents containing individual terms. But “New York” should match the phrase, not documents that contain “new” and “york” separately (like “new product from York, England”).

Phrase queries require storing position information in the index: not just which documents contain a term, but where in the document. “New York” as a phrase requires “new” at position N and “york” at position N+1 in the same document.

Fuzzy matching

Users make typos. “sheos” should match “shoes.” Fuzzy matching uses edit distance (Levenshtein distance) to find terms within N edits of the query term. Elasticsearch supports fuzziness: AUTO which automatically determines the allowed edit distance based on term length.

E-commerce search needs facets: “filter by brand,” “filter by price range,” “filter by color.” These are not full-text queries - they are structured filters on indexed fields. Search engines support this through field-level indexing and aggregations.

The freshness problem

New documents need to be indexed before they appear in search results. Indexing is not instant. Elasticsearch uses a refresh interval (default 1 second) to make new documents searchable. During the refresh interval, new documents are not visible.

For near-real-time requirements, reduce the refresh interval. For batch indexing, increase it to improve throughput.

A single search index has limits. Elasticsearch distributes the index across shards. Each shard is an independent Lucene index. Queries run on all shards in parallel and results are merged.

The challenge: relevance scores are computed per-shard. A term that is rare globally might be common in one shard, giving it a lower IDF score on that shard. This can cause relevance inconsistencies. Solutions: use a global IDF (query all shards for IDF statistics before scoring), or use enough data per shard that local statistics approximate global statistics.

graph LR
subgraph distributed["Distributed Search"]
  Q2["Query: 'blue shoes'"]
  COORD["Coordinator node"]
  SH1["Shard 1
10M docs"]
  SH2["Shard 2
10M docs"]
  SH3["Shard 3
10M docs"]
  MERGE["Merge and rank
top 10 results"]
  Q2 --> COORD
  COORD --> SH1
  COORD --> SH2
  COORD --> SH3
  SH1 -->|"top 10 from shard"| MERGE
  SH2 -->|"top 10 from shard"| MERGE
  SH3 -->|"top 10 from shard"| MERGE
end

style COORD fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style SH1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style SH2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style SH3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style MERGE fill:#FAEEDA,stroke:#854F0B,color:#633806

Real-world systems

Elasticsearch - The most widely used search engine. Built on Apache Lucene. Distributed, RESTful API, supports full-text search, structured queries, aggregations, and geospatial queries. Used by GitHub (code search), Wikipedia, Netflix, Uber.

Apache Solr - Also built on Lucene. More mature than Elasticsearch, more complex to configure. Used by many enterprise applications.

Typesense - Modern, fast, typo-tolerant search engine. Simpler to operate than Elasticsearch. Good for product search and autocomplete.

Meilisearch - Open-source, easy to set up, excellent for small to medium datasets. Instant search with typo tolerance.

Algolia - Hosted search-as-a-service. Extremely fast (sub-10ms). Good developer experience. Expensive at scale.

PostgreSQL full-text search - Built-in full-text search using GIN indexes on tsvector columns. Good for moderate search requirements without a separate search service.

How to apply it in practice

When to use a dedicated search engine

Use Elasticsearch or similar when:

  • You need full-text search with relevance ranking
  • You need faceted search (filters + aggregations)
  • You need fuzzy matching and typo tolerance
  • Your search index is large (millions of documents)
  • You need near-real-time indexing

Use PostgreSQL full-text search when:

  • Your dataset is small to medium (under 1 million documents)
  • You want to avoid operational complexity of a separate service
  • Your search requirements are basic (no complex relevance tuning)

The sync problem

Your primary data is in PostgreSQL. Your search index is in Elasticsearch. They must stay in sync. Options:

Dual write - Write to both on every change. Risk: partial failures leave them out of sync.

Change data capture (CDC) - Stream database changes (via Debezium reading PostgreSQL WAL) to Elasticsearch. Reliable, eventually consistent.

Periodic re-index - Rebuild the search index from the database on a schedule. Simple but introduces lag.

Event-driven - Publish events on data changes, consume them to update the search index. Decoupled but requires an event bus.

Relevance tuning

Default BM25 relevance is a starting point. Tune it for your use case:

  • Boost fields - Title matches are more relevant than body matches. Boost the title field.
  • Boost recent documents - For news or products, newer items should rank higher.
  • Synonyms - Add domain-specific synonyms (product names, abbreviations).
  • Stopwords - Add domain-specific stopwords (your product name should not be a stopword).
  • A/B test - Measure click-through rate and conversion rate for different relevance configurations.

FAQ

Q: What is the difference between search and filtering?

Filtering is exact matching: WHERE price < 100 AND category = 'shoes'. It is deterministic - a document either matches or it does not. Search is relevance matching: find documents most relevant to “comfortable running shoes.” It is probabilistic - documents are ranked by relevance score. In practice, search engines combine both: full-text search for relevance ranking, plus structured filters for exact matching. Elasticsearch calls these “query” (relevance) and “filter” (exact match, cached).

Q: How does autocomplete work?

Autocomplete requires a different index than full-text search. Common approaches: edge n-gram tokenization (index “shoe” as “s”, “sh”, “sho”, “shoe” so any prefix matches), completion suggester (Elasticsearch has a dedicated data structure for this), or a separate prefix index. The key requirement is speed: autocomplete must return results in under 50ms. This often means a separate, smaller index optimized for prefix lookups, not the full search index.

Q: How do you handle search in multiple languages?

Each language needs its own analyzer: different tokenization rules, different stop words, different stemming algorithms. Elasticsearch supports language-specific analyzers (english, french, german, etc.). For multilingual content, either store the language of each document and use the appropriate analyzer, or use a language-agnostic analyzer (ICU tokenizer) that handles multiple languages. For search queries, detect the query language and use the appropriate analyzer.

Interview questions

Q1: Design the search feature for an e-commerce site with 10 million products. Users search by name, description, and filter by category, price, and brand.

Strong answer: Use Elasticsearch. Index each product as a document with fields: name (text, boosted), description (text), category (keyword, for filtering), brand (keyword, for filtering and facets), price (float, for range filtering and sorting), tags (keyword array). Use a custom analyzer with synonyms for the product domain. For the query: combine a multi-match query on name and description (relevance) with term filters on category and brand (exact match, cached). Use aggregations for facet counts (how many products in each category, brand). For autocomplete: a separate completion suggester index on product names. For relevance tuning: boost name matches over description matches, boost products with higher sales rank. Sync from PostgreSQL using CDC (Debezium) to keep the index up to date.

Q2: Your Elasticsearch cluster is slow. Queries that used to take 50ms now take 2 seconds. What do you investigate?

Strong answer: Start with the slow log and hot threads API to identify which queries are slow and what the cluster is doing. Check cluster health: are all shards assigned? Any red or yellow status? Check JVM heap usage: if heap is above 75%, GC pressure is causing pauses. Check shard count: too many small shards (over 20 per GB of heap) causes overhead. Check if queries are doing expensive operations: wildcard queries, script scoring, large aggregations. Check if the index has too many fields (mapping explosion). Check disk I/O: if the OS page cache is not warm (after a restart), queries are slow until data is cached. For immediate relief: increase JVM heap, reduce shard count by merging small shards, add replicas to distribute query load. Long-term: optimize queries, tune the data model, add more nodes.

Q3: How would you implement “did you mean” (spell correction) in a search engine?

Strong answer: Spell correction requires knowing what the correct terms are. Two approaches. First, dictionary-based: maintain a dictionary of valid terms (from your product catalog, user queries that returned results). When a query term is not in the dictionary, find the closest term by edit distance (Levenshtein). Elasticsearch’s suggest API does this. Second, frequency-based: collect all search queries and their result counts. If “sheos” returns 0 results but “shoes” returns 10,000, suggest “shoes.” This is how Google’s “did you mean” works - it is based on what other users searched for, not a dictionary. For implementation: use Elasticsearch’s phrase suggester for context-aware suggestions (it considers the surrounding words, not just the individual term). Show the suggestion only when the original query returns few or no results.