Vector Databases: The Storage Engine Behind Semantic Search


You embed 10 million product descriptions into 1536-dimensional vectors. Now a user searches for “comfortable work-from-home chair with lumbar support.” You embed the query, and you need to find the 10 most similar vectors out of 10 million. Brute force: compute cosine similarity against all 10 million vectors. That is 10 million dot products of 1536-dimensional vectors. Per query. At 100 queries per second, you need 1 billion dot products per second. On a single CPU, this takes about 15 seconds per query. Your users are not waiting 15 seconds.

Vector databases solve this by trading a small amount of accuracy for massive speed improvements. Instead of checking every vector, they use approximate nearest neighbor (ANN) algorithms that check only a tiny fraction of vectors and still find results that are 95-99% as good as brute force - in 5-50 milliseconds instead of 15 seconds.

What a vector database actually is

A vector database is a specialized database optimized for storing, indexing, and querying high-dimensional vectors. Unlike traditional databases that index on exact values (B-trees for integers, inverted indexes for text), vector databases index on similarity - finding vectors that are “close” in high-dimensional space.

Core operations:

  • Upsert: Store a vector with an ID and optional metadata
  • Search: Given a query vector, find the k most similar stored vectors
  • Filter: Combine vector similarity with metadata filters (e.g., “similar vectors where category = ‘furniture’”)
  • Delete: Remove vectors by ID or filter
graph TD
  subgraph vdb["Vector Database"]
      IDX["ANN Index
(HNSW, IVF, etc.)"]
      STORE["Vector Storage
(raw vectors)"]
      META["Metadata Store
(filters, attributes)"]
  end
  
  Q["Query Vector"] --> IDX
  IDX --> |"Candidate IDs"| STORE
  STORE --> |"Exact distances"| RES["Top-K Results"]
  META --> |"Pre/post filter"| RES

  style Q fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style IDX fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style RES fill:#FAEEDA,stroke:#854F0B,color:#633806

ANN indexing algorithms

The core technology that makes vector databases fast:

HNSW (Hierarchical Navigable Small World)

The most popular algorithm in production. Builds a multi-layer graph where:

  • Bottom layer: all vectors connected to their nearest neighbors
  • Higher layers: progressively sparser subsets of vectors
  • Search: start at the top layer, greedily navigate toward the query, then refine at lower layers

Think of it like navigating a city: start on the highway (top layer) to get to the right neighborhood, then use local streets (bottom layer) to find the exact address.

Performance: 95-99% recall at 1-10ms latency for millions of vectors. Memory-intensive (index lives in RAM).

IVF (Inverted File Index)

Clusters vectors into partitions (Voronoi cells) using k-means. At search time, only checks vectors in the nearest clusters:

  1. Pre-compute: cluster all vectors into N partitions
  2. Search: find the closest partition centroids, then search only within those partitions
  3. nprobe parameter: how many partitions to check (more = better recall, slower)

Performance: Good recall with less memory than HNSW. Slower for high-recall requirements. Often combined with PQ compression.

PQ (Product Quantization)

Compresses vectors to reduce memory. Splits each 1536-dim vector into sub-vectors and quantizes each independently. A 1536-dim float32 vector (6KB) can be compressed to 192 bytes (32x compression).

Tradeoff: Faster search and lower memory, but reduced accuracy. Often used with IVF: IVF for coarse search, PQ for compressed distance computation.

graph LR
  subgraph algorithms["Index Algorithm Comparison"]
      HNSW["HNSW
Best recall
High memory
1-10ms latency
Best for: <10M vectors"]
      IVF["IVF-PQ
Good recall
Low memory
5-50ms latency
Best for: >10M vectors"]
      FLAT["Flat (Brute Force)
Perfect recall
No index overhead
Slow at scale
Best for: <100K vectors"]
  end

  style HNSW fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style IVF fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style FLAT fill:#FAEEDA,stroke:#854F0B,color:#633806

Distance metrics

How “similarity” is measured:

Cosine similarity: Measures the angle between vectors. Ignores magnitude. Best for text embeddings where direction (meaning) matters more than magnitude. Range: -1 to 1.

Euclidean distance (L2): Straight-line distance. Affected by magnitude. Best for embeddings that are not normalized. Lower = more similar.

Dot product (Inner Product): Fast computation. Equivalent to cosine for normalized vectors. Best when embeddings are pre-normalized and you want maximum speed.

Rule of thumb: If your embedding model outputs normalized vectors (most modern models do), cosine and dot product are equivalent. Use dot product for speed. If vectors are not normalized, use cosine.

Managed vs self-hosted options

Managed (serverless)

DatabaseStrengthsPricing model
PineconeSimple API, serverless, auto-scalingPer-query + storage
Weaviate CloudHybrid search built-in, GraphQL APIPer-node
Qdrant CloudHigh performance, rich filteringPer-node
Zilliz (Milvus Cloud)Massive scale, GPU accelerationPer-CU (compute unit)

Self-hosted

DatabaseStrengthsBest for
pgvectorPostgreSQL extension, familiar toolingSmall-medium scale, existing Postgres
MilvusDistributed, massive scale100M+ vectors
QdrantRust-based, fast, rich filteringPerformance-critical workloads
ChromaSimple, embedded, great for prototypingDevelopment and small datasets
FAISSLibrary (not DB), Meta-developedCustom integration, research

Where vector databases break

Metadata filtering + vector search interaction

Filtering narrows the candidate set before or after vector search. Pre-filtering (filter first, then search within filtered set) can miss good results if the best vector match is outside the filter. Post-filtering (search all vectors, then filter results) wastes compute on results that will be filtered out.

Most databases use a hybrid approach, but the interaction between filters and similarity search is a common source of unexpected behavior.

The “freshness” problem

New vectors need to be indexed before they are searchable. Some databases index synchronously (immediate availability but slower writes). Others batch index (fast writes but search delay). For real-time applications, understand your database’s indexing latency.

Dimensionality vs cost

1536-dimensional vectors at scale are expensive to store and search:

  • 10M vectors × 1536 dims × 4 bytes = 60 GB just for raw vectors
  • HNSW index adds 2-3x memory overhead
  • Total: 120-180 GB RAM for 10M vectors

Consider dimension reduction (Matryoshka embeddings, PCA) or quantization if cost is a concern.

Multi-tenancy

If you serve multiple customers from one vector database, you need isolation. Options:

  • Namespace/partition per tenant (Pinecone namespaces, Qdrant collections)
  • Metadata filter per tenant (simpler but leakier)
  • Separate database per tenant (most isolated but most expensive)

Recall vs latency tradeoff

ANN algorithms have tunable parameters that trade recall for speed:

  • HNSW: ef_search (higher = better recall, slower)
  • IVF: nprobe (higher = more partitions checked, slower)
  • Quantization bits (more bits = better accuracy, more memory)

There is no free lunch. Measure your recall at your latency budget.

Real-world implementations

  • Notion AI - uses vector search over workspace content for contextual Q&A
  • Spotify - collaborative filtering and content-based recommendations using vector similarity
  • Shopify - product search and recommendations powered by embeddings in a vector store
  • Stripe - fraud detection using transaction embedding similarity
  • GitHub Copilot - code context retrieval from repository embeddings

How to apply in practice

Start with pgvector if you already use PostgreSQL and have <1M vectors. The operational simplicity of not adding another database to your stack is worth more than the performance difference at small scale.

Use a managed service for production workloads >1M vectors. The operational burden of scaling, replication, and backup for a vector database is significant. Let Pinecone/Qdrant/Weaviate handle it.

Always benchmark on your data. Published benchmarks use synthetic datasets. Your vectors, your query patterns, and your filter requirements may behave differently. Load test with realistic data before committing.

Index configuration matters more than database choice. A well-configured pgvector with the right index parameters can outperform a poorly configured Pinecone instance. Invest time in tuning ef_construction, ef_search, m (for HNSW), or nlist/nprobe (for IVF).

Plan for updates. Embeddings change when you update your embedding model. You need to re-embed and re-index your entire corpus. Design your pipeline so this is a button-press, not a migration project.

FAQ

Q: Can I just use PostgreSQL with pgvector instead of a specialized vector database?

Yes, for small to medium scale (<1-5M vectors). pgvector is production-ready, supports HNSW indexes, and integrates with your existing PostgreSQL infrastructure. You get transactions, joins with metadata tables, and familiar tooling. The tradeoff: at 10M+ vectors, specialized databases (Qdrant, Milvus) offer better query performance and more sophisticated index options. If you are already on Postgres and have <5M vectors, pgvector is the pragmatic choice.

Q: How do I evaluate recall/accuracy of my vector search?

Create a test set: for 100-200 queries, compute the true top-10 results using brute force (exact search). Then run the same queries against your ANN index and measure what percentage of the true top-10 appears in the ANN top-10. This is recall@10. Target: 95%+ for most RAG applications. Below 90%, you are missing too many relevant results and need to tune index parameters.

Q: What happens when I need to update my embedding model?

You need to re-embed everything and rebuild the index. This is a significant operation at scale (millions of embedding API calls). Mitigate by: keeping the original text stored alongside vectors (so re-embedding is just an API call, not re-processing), building pipeline automation for re-indexing, and supporting multiple index versions so you can cut over without downtime (blue-green deployment for your vector index).

Interview questions

Q: Design the vector search infrastructure for a RAG system serving 1000 queries/second over 50 million document chunks.

At 50M vectors and 1000 QPS: (1) Choose a distributed solution - Milvus or Qdrant cluster with sharding across multiple nodes. (2) Use IVF-HNSW or IVF-PQ index for the scale - pure HNSW needs ~300GB RAM for 50M 1536-dim vectors, which requires expensive machines. With PQ compression (8x), reduce to ~40GB per replica. (3) Multiple read replicas for QPS - at 1000 QPS with ~10ms per query, need at least 10 query nodes. (4) Separate write path: batch indexing with periodic merges, not real-time. (5) Caching: cache frequently-queried results (popular questions) at the application layer. (6) Monitor: track p99 latency, recall (sample and compare to brute force periodically), and index freshness.

Q: Your vector search returns 10 results, but only 3 are actually relevant to the query. The other 7 are semantically similar but not useful. How do you improve precision?

Multi-stage approach: (1) Reranking - use a cross-encoder model to re-score the top-20 vector results for actual query-document relevance, return top-5. (2) Metadata filtering - add filters that eliminate obviously irrelevant results (wrong document type, outdated content, wrong product category). (3) Embedding model improvement - if the model consistently returns irrelevant results for certain query types, it may need domain-specific fine-tuning. (4) Hybrid search - combine vector similarity with BM25 keyword matching. Documents that score well on both are more likely to be truly relevant. (5) Query understanding - expand or rewrite the query to better capture intent before embedding. Measure precision@5 before and after each change to quantify improvement.

Q: Compare using a vector database vs building similarity search on top of Redis or Elasticsearch. When would you choose each?

Redis with vector extensions (RediSearch): good for small-medium datasets where you already use Redis, need sub-millisecond latency, and want to combine vector search with other Redis operations. Limited scaling options. Elasticsearch with dense_vector: good when you need hybrid search (text BM25 + vector) in a single system, already use ES for search, and your team knows ES operations. Heavier operationally, but versatile. Dedicated vector database: best when vector search is your primary use case, you need optimal performance at scale (>5M vectors), you need advanced ANN algorithms and tuning, or you are building a RAG system where retrieval quality is the bottleneck. The key decision factor: if vector search is a feature within a larger system, use existing infrastructure (Redis, ES). If it is the core capability, use a purpose-built vector database.