Embeddings & Vector Spaces: How Machines Understand Meaning
You build a search feature for your documentation site. Users type “how to fix authentication errors” and expect to find the article titled “Troubleshooting Login Failures.” Keyword search fails - none of those words overlap. You could manually tag every article with synonyms, but you have 10,000 articles and the vocabulary keeps growing.
Then you embed both the query and every article into a vector space. Suddenly “fix authentication errors” and “troubleshooting login failures” land close together in 1536-dimensional space because they mean similar things, even though they share zero words. Your search works. No synonym dictionaries. No manual tagging. Just math that captures meaning.
Embeddings are the bridge between human language and machine computation. If you are building anything with AI - search, RAG, recommendations, clustering, classification - you are building on embeddings whether you realize it or not.
What an embedding actually is
An embedding is a dense vector (a list of floating-point numbers) that represents the semantic meaning of a piece of text. The key properties:
- Fixed dimensionality: Regardless of whether you embed a single word or a 500-word paragraph, you get the same number of dimensions (e.g., 1536 for OpenAI’s text-embedding-3-small, 768 for many open-source models)
- Semantic proximity: Texts with similar meanings produce vectors that are close together in the vector space
- Compositionality: The vector for “king - man + woman” lands near “queen” because the relationships are encoded geometrically
The numbers themselves are not interpretable by humans. Dimension 847 does not mean “formality” or “topic.” But collectively, the 1536 dimensions encode rich semantic information that emerges from training on billions of text pairs.
graph TD
subgraph inputs["Different Inputs"]
A["'How to reset my password'"]
B["'Password recovery steps'"]
C["'Best pizza in NYC'"]
end
subgraph vectors["Vector Space (1536 dims)"]
VA["[0.23, -0.45, 0.12, ..., 0.67]"]
VB["[0.21, -0.43, 0.14, ..., 0.65]"]
VC["[-0.56, 0.78, -0.33, ..., -0.12]"]
end
subgraph distance["Similarity"]
D["A ↔ B: 0.94 (very similar)"]
E["A ↔ C: 0.12 (unrelated)"]
end
A --> VA
B --> VB
C --> VC
style A fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style B fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style C fill:#F1EFE8,stroke:#888780,color:#444441
style D fill:#E1F5EE,stroke:#0F6E56,color:#085041
style E fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
How embeddings are created
Word embeddings (the foundation)
The original insight came from Word2Vec (2013): train a neural network to predict a word from its context (or vice versa), and the hidden layer weights become useful representations. Words that appear in similar contexts get similar vectors.
Sentence and document embeddings
Modern embedding models go beyond words. They use transformer architectures (similar to LLMs but optimized differently) to encode entire passages into single vectors:
- Input text is tokenized and passed through transformer layers
- Pooling combines all token representations into one vector (mean pooling, CLS token, or attention-weighted)
- Training objective is contrastive: push semantically similar pairs closer together, push dissimilar pairs apart
The training data matters enormously. Models trained on question-answer pairs excel at Q&A retrieval. Models trained on paraphrase pairs excel at semantic similarity. This is why choosing the right embedding model for your use case matters.
The contrastive training loop
Positive pair: ("What causes high latency?", "Network congestion increases response times")
Negative pair: ("What causes high latency?", "The best pizza recipe uses fresh mozzarella")
Goal: minimize distance between positive pairs, maximize distance between negative pairs
Measuring similarity
Two vectors are “similar” if they point in roughly the same direction. The standard measures:
Cosine similarity: The cosine of the angle between two vectors. Ranges from -1 (opposite) to 1 (identical direction). Ignores magnitude, focuses on direction. This is the default for most embedding use cases.
Dot product: Faster to compute, but affected by vector magnitude. Works well when vectors are normalized (unit length).
Euclidean distance: Straight-line distance between vector endpoints. Less common for embeddings because it is affected by magnitude.
graph LR
subgraph cosine["Cosine Similarity"]
direction TB
CS1["cos(A, B) = 0.95
Almost identical meaning"]
CS2["cos(A, C) = 0.02
Completely unrelated"]
CS3["cos(A, D) = -0.8
Opposite meaning"]
end
style CS1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style CS2 fill:#F1EFE8,stroke:#888780,color:#444441
style CS3 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
Where embeddings break or get interesting
The curse of dimensionality
In high-dimensional spaces, distances between points tend to converge. All points become roughly equidistant from each other. This means that in 1536 dimensions, the difference between “very similar” (cosine 0.92) and “somewhat related” (cosine 0.78) is meaningful, but the absolute numbers are harder to interpret than you might expect. You cannot set a universal similarity threshold - it depends on the model, the domain, and the data distribution.
Domain mismatch
An embedding model trained on web text will not perform well on medical literature, legal documents, or codebases without domain-specific fine-tuning. The model encodes relationships it saw during training. If your domain uses words differently (“bug” in software vs entomology), general-purpose embeddings will conflate them.
Asymmetric search
When a user types a short query (“how to deploy kubernetes”) and you are matching against long documents, the query and document have very different structures. Some embedding models handle this asymmetry well (trained on query-document pairs), while others assume both inputs are similar in length and style. Models like E5 and BGE have separate prefixes for queries vs documents to handle this.
Embedding drift
If your documents change over time (product updates, new features), embeddings computed months ago may not align well with embeddings of current queries. You need a re-embedding strategy for living document collections.
Real-world systems that use embeddings
- OpenAI text-embedding-3 - 1536 or 3072 dimensions, used in ChatGPT’s retrieval and most GPT-powered apps
- Cohere Embed v3 - trained with compression-aware objectives, supports Matryoshka embeddings (variable dimensionality)
- Google Gecko/Gemini Embeddings - multimodal embeddings that work across text, images, and video
- Pinecone, Weaviate, Qdrant, Milvus - vector databases optimized for storing and querying billions of embeddings with sub-millisecond latency
- Spotify - embeds songs and user preferences in the same vector space for recommendations
- Airbnb - embeds listings and search queries to match beyond keyword overlap
How to apply embeddings in practice
Choosing an embedding model
| Factor | Consideration |
|---|---|
| Dimension count | Higher = more expressive but more storage/compute. 768-1536 is the sweet spot |
| Domain match | Pick models trained on data similar to yours |
| Sequence length | Most models cap at 512-8192 tokens per input |
| Latency | Smaller models (384 dims) for real-time, larger (3072 dims) for batch |
| Multilingual | Multilingual models if your content spans languages |
The embedding pipeline
- Chunk your content - Break documents into meaningful segments (paragraphs, sections, or semantic chunks)
- Embed at ingestion - Compute embeddings once and store them alongside metadata
- Embed queries at runtime - Convert the user’s query to a vector using the same model
- Nearest neighbor search - Find the k closest stored vectors (ANN algorithms like HNSW)
- Post-process - Rerank, filter by metadata, apply business logic
Matryoshka embeddings
A recent technique where the model is trained so that the first N dimensions of the full embedding are still useful. You can store the full 1536-dimension vector for high-accuracy search, but use only the first 256 dimensions for fast pre-filtering. OpenAI’s text-embedding-3 supports this natively.
Normalization matters
Always normalize your vectors to unit length before storing if you plan to use cosine similarity. Many vector databases do this automatically, but if you are building custom similarity search, forgetting to normalize leads to magnitude-biased results where longer documents score higher regardless of relevance.
FAQ
Q: Can I use LLM embeddings from the last hidden layer instead of a dedicated embedding model?
You can, but dedicated embedding models outperform LLM hidden states for retrieval tasks. LLMs are optimized for generation (predicting the next token), not for producing representations where similar texts cluster together. Dedicated models are trained with contrastive objectives specifically designed for similarity. Use LLM embeddings only when you need them for a specialized task and cannot find a suitable embedding model.
Q: How many dimensions do I actually need?
For most production use cases, 768-1536 dimensions are sufficient. Going to 3072 provides marginal accuracy gains (1-3% on benchmarks) at double the storage cost. Going below 384 noticeably hurts quality. The Matryoshka approach lets you hedge: store full vectors, search with truncated ones for speed, and re-rank with full vectors for accuracy.
Q: Do embeddings understand negation? Will “not good” and “good” be far apart?
This is a known weakness. Many embedding models struggle with negation because they are biased toward semantic similarity of component words. “This movie is not good” and “this movie is good” may end up closer than you want because both contain “movie” and “good.” Newer models (E5-Mistral, Cohere v3) handle negation better, but it is worth testing on your specific data. For critical negation handling, consider using a reranker on top of vector search.
Interview questions
Q: You need to build a semantic search system for 10 million product descriptions. Walk me through the architecture.
Strong answers cover: choosing an embedding model appropriate for e-commerce text, chunking strategy for product descriptions (likely whole-description embeddings since they are short), selecting a vector database that handles 10M vectors with low latency (Pinecone, Qdrant, or Milvus with HNSW indexing), the ingestion pipeline (batch embed, upsert to vector DB), query path (embed query, ANN search for top-50, rerank to top-10, return with metadata), and index configuration (dimension count, distance metric, replica count for availability).
Q: Your embedding-based search returns semantically relevant results but users complain they are “too broad.” How do you improve precision without sacrificing recall?
Hybrid approach: combine vector similarity with keyword matching (BM25) and metadata filters. Use a reranker (cross-encoder) on the top-50 vector results to re-score with more expensive but accurate pairwise relevance. Add user feedback signals to learn domain-specific relevance. Consider fine-tuning the embedding model on your domain’s query-document pairs if you have labeled data.
Q: Explain the tradeoff between embedding dimensionality and system performance. When would you choose 384 vs 1536 vs 3072 dimensions?
384 dimensions: real-time applications with millions of vectors where latency and storage costs dominate (e.g., autocomplete, live recommendations). 1536 dimensions: the default for most production RAG and search systems - good balance of quality and efficiency. 3072 dimensions: high-stakes applications where retrieval accuracy is critical and you can afford the 2x storage and slight latency increase (legal search, medical literature). Always benchmark on your specific data - the quality gap between 768 and 1536 might be negligible for your domain.