Embeddings & Vector Spaces: How Machines Understand Meaning

You build a search feature for your documentation site. Users type “how to fix authentication errors” and expect to find the article titled “Troubleshooting Login Failures.” Keyword search fails - none of those words overlap. You could manually tag every article with synonyms, but you have 10,000 articles and the vocabulary keeps growing.

Then you embed both the query and every article into a vector space. Suddenly “fix authentication errors” and “troubleshooting login failures” land close together in 1536-dimensional space because they mean similar things, even though they share zero words. Your search works. No synonym dictionaries. No manual tagging. Just math that captures meaning.

Embeddings are the bridge between human language and machine computation. If you are building anything with AI - search, RAG, recommendations, clustering, classification - you are building on embeddings whether you realize it or not.

What an embedding actually is

An embedding is a dense vector (a list of floating-point numbers) that represents the semantic meaning of a piece of text. The key properties:

Fixed dimensionality: Regardless of whether you embed a single word or a 500-word paragraph, you get the same number of dimensions (e.g., 1536 for OpenAI’s text-embedding-3-small, 768 for many open-source models)
Semantic proximity: Texts with similar meanings produce vectors that are close together in the vector space
Compositionality: The vector for “king - man + woman” lands near “queen” because the relationships are encoded geometrically

The numbers themselves are not interpretable by humans. Dimension 847 does not mean “formality” or “topic.” But collectively, the 1536 dimensions encode rich semantic information that emerges from training on billions of text pairs.

graph TD
  subgraph inputs["Different Inputs"]
      A["'How to reset my password'"]
      B["'Password recovery steps'"]
      C["'Best pizza in NYC'"]
  end
  subgraph vectors["Vector Space (1536 dims)"]
      VA["[0.23, -0.45, 0.12, ..., 0.67]"]
      VB["[0.21, -0.43, 0.14, ..., 0.65]"]
      VC["[-0.56, 0.78, -0.33, ..., -0.12]"]
  end
  subgraph distance["Similarity"]
      D["A ↔ B: 0.94 (very similar)"]
      E["A ↔ C: 0.12 (unrelated)"]
  end

  A --> VA
  B --> VB
  C --> VC

  style A fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style B fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style C fill:#F1EFE8,stroke:#888780,color:#444441
  style D fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style E fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

How embeddings are created

Word embeddings (the foundation)

The original insight came from Word2Vec (2013): train a neural network to predict a word from its context (or vice versa), and the hidden layer weights become useful representations. Words that appear in similar contexts get similar vectors.

Sentence and document embeddings

Modern embedding models go beyond words. They use transformer architectures (similar to LLMs but optimized differently) to encode entire passages into single vectors:

Input text is tokenized and passed through transformer layers
Pooling combines all token representations into one vector (mean pooling, CLS token, or attention-weighted)
Training objective is contrastive: push semantically similar pairs closer together, push dissimilar pairs apart

The training data matters enormously. Models trained on question-answer pairs excel at Q&A retrieval. Models trained on paraphrase pairs excel at semantic similarity. This is why choosing the right embedding model for your use case matters.

The contrastive training loop

Positive pair: ("What causes high latency?", "Network congestion increases response times")
Negative pair: ("What causes high latency?", "The best pizza recipe uses fresh mozzarella")

Goal: minimize distance between positive pairs, maximize distance between negative pairs

Measuring similarity

Two vectors are “similar” if they point in roughly the same direction. The standard measures:

Cosine similarity: The cosine of the angle between two vectors. Ranges from -1 (opposite) to 1 (identical direction). Ignores magnitude, focuses on direction. This is the default for most embedding use cases.

Dot product: Faster to compute, but affected by vector magnitude. Works well when vectors are normalized (unit length).

Euclidean distance: Straight-line distance between vector endpoints. Less common for embeddings because it is affected by magnitude.

graph LR
  subgraph cosine["Cosine Similarity"]
      direction TB
      CS1["cos(A, B) = 0.95
Almost identical meaning"]
      CS2["cos(A, C) = 0.02
Completely unrelated"]
      CS3["cos(A, D) = -0.8
Opposite meaning"]
  end

  style CS1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style CS2 fill:#F1EFE8,stroke:#888780,color:#444441
  style CS3 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Where embeddings break or get interesting

The curse of dimensionality

In high-dimensional spaces, distances between points tend to converge. All points become roughly equidistant from each other. This means that in 1536 dimensions, the difference between “very similar” (cosine 0.92) and “somewhat related” (cosine 0.78) is meaningful, but the absolute numbers are harder to interpret than you might expect. You cannot set a universal similarity threshold - it depends on the model, the domain, and the data distribution.

Domain mismatch

An embedding model trained on web text will not perform well on medical literature, legal documents, or codebases without domain-specific fine-tuning. The model encodes relationships it saw during training. If your domain uses words differently (“bug” in software vs entomology), general-purpose embeddings will conflate them.

Asymmetric search

When a user types a short query (“how to deploy kubernetes”) and you are matching against long documents, the query and document have very different structures. Some embedding models handle this asymmetry well (trained on query-document pairs), while others assume both inputs are similar in length and style. Models like E5 and BGE have separate prefixes for queries vs documents to handle this.

Embedding drift

If your documents change over time (product updates, new features), embeddings computed months ago may not align well with embeddings of current queries. You need a re-embedding strategy for living document collections.

Real-world systems that use embeddings

OpenAI text-embedding-3 - 1536 or 3072 dimensions, used in ChatGPT’s retrieval and most GPT-powered apps
Cohere Embed v3 - trained with compression-aware objectives, supports Matryoshka embeddings (variable dimensionality)
Google Gecko/Gemini Embeddings - multimodal embeddings that work across text, images, and video
Pinecone, Weaviate, Qdrant, Milvus - vector databases optimized for storing and querying billions of embeddings with sub-millisecond latency
Spotify - embeds songs and user preferences in the same vector space for recommendations
Airbnb - embeds listings and search queries to match beyond keyword overlap

How to apply embeddings in practice

Choosing an embedding model

Factor	Consideration
Dimension count	Higher = more expressive but more storage/compute. 768-1536 is the sweet spot
Domain match	Pick models trained on data similar to yours
Sequence length	Most models cap at 512-8192 tokens per input
Latency	Smaller models (384 dims) for real-time, larger (3072 dims) for batch
Multilingual	Multilingual models if your content spans languages

The embedding pipeline

Chunk your content - Break documents into meaningful segments (paragraphs, sections, or semantic chunks)
Embed at ingestion - Compute embeddings once and store them alongside metadata
Embed queries at runtime - Convert the user’s query to a vector using the same model
Nearest neighbor search - Find the k closest stored vectors (ANN algorithms like HNSW)
Post-process - Rerank, filter by metadata, apply business logic

Matryoshka embeddings

A recent technique where the model is trained so that the first N dimensions of the full embedding are still useful. You can store the full 1536-dimension vector for high-accuracy search, but use only the first 256 dimensions for fast pre-filtering. OpenAI’s text-embedding-3 supports this natively.

Normalization matters

Always normalize your vectors to unit length before storing if you plan to use cosine similarity. Many vector databases do this automatically, but if you are building custom similarity search, forgetting to normalize leads to magnitude-biased results where longer documents score higher regardless of relevance.

FAQ

Q: Can I use LLM embeddings from the last hidden layer instead of a dedicated embedding model?

You can, but dedicated embedding models outperform LLM hidden states for retrieval tasks. LLMs are optimized for generation (predicting the next token), not for producing representations where similar texts cluster together. Dedicated models are trained with contrastive objectives specifically designed for similarity. Use LLM embeddings only when you need them for a specialized task and cannot find a suitable embedding model.

Q: How many dimensions do I actually need?

For most production use cases, 768-1536 dimensions are sufficient. Going to 3072 provides marginal accuracy gains (1-3% on benchmarks) at double the storage cost. Going below 384 noticeably hurts quality. The Matryoshka approach lets you hedge: store full vectors, search with truncated ones for speed, and re-rank with full vectors for accuracy.

Q: Do embeddings understand negation? Will “not good” and “good” be far apart?

This is a known weakness. Many embedding models struggle with negation because they are biased toward semantic similarity of component words. “This movie is not good” and “this movie is good” may end up closer than you want because both contain “movie” and “good.” Newer models (E5-Mistral, Cohere v3) handle negation better, but it is worth testing on your specific data. For critical negation handling, consider using a reranker on top of vector search.

Interview questions

Q: You need to build a semantic search system for 10 million product descriptions. Walk me through the architecture.

Strong answers cover: choosing an embedding model appropriate for e-commerce text, chunking strategy for product descriptions (likely whole-description embeddings since they are short), selecting a vector database that handles 10M vectors with low latency (Pinecone, Qdrant, or Milvus with HNSW indexing), the ingestion pipeline (batch embed, upsert to vector DB), query path (embed query, ANN search for top-50, rerank to top-10, return with metadata), and index configuration (dimension count, distance metric, replica count for availability).

Q: Your embedding-based search returns semantically relevant results but users complain they are “too broad.” How do you improve precision without sacrificing recall?

Hybrid approach: combine vector similarity with keyword matching (BM25) and metadata filters. Use a reranker (cross-encoder) on the top-50 vector results to re-score with more expensive but accurate pairwise relevance. Add user feedback signals to learn domain-specific relevance. Consider fine-tuning the embedding model on your domain’s query-document pairs if you have labeled data.

Q: Explain the tradeoff between embedding dimensionality and system performance. When would you choose 384 vs 1536 vs 3072 dimensions?

384 dimensions: real-time applications with millions of vectors where latency and storage costs dominate (e.g., autocomplete, live recommendations). 1536 dimensions: the default for most production RAG and search systems - good balance of quality and efficiency. 3072 dimensions: high-stakes applications where retrieval accuracy is critical and you can afford the 2x storage and slight latency increase (legal search, medical literature). Always benchmark on your specific data - the quality gap between 768 and 1536 might be negligible for your domain.