Embedding Model Selection: Choosing the Right Model for Your Use Case
You build a RAG system for your legal tech startup using OpenAI’s text-embedding-ada-002 because it is the default everyone uses. Retrieval works fine for general questions. Then lawyers start asking about specific case citations, regulatory provisions, and contract clause interpretations. Retrieval quality drops. The embedding model was trained on general web text - it does not understand that “Section 230(c)(1)” and “platform immunity provision” refer to the same concept, or that “force majeure” and “unforeseeable circumstances clause” are semantically related in legal context.
You switch to a legal-domain embedding model. Recall jumps 15%. The model understands legal vocabulary, citation formats, and the relationships between legal concepts because it was trained on legal text. Same architecture, same dimensionality, same retrieval pipeline - but dramatically better results because the model’s training data matches your domain.
Embedding model selection is not about picking the “best” model on a leaderboard. It is about matching model capabilities to your specific requirements: domain, languages, query types, latency constraints, and cost.
What makes embedding models different
All embedding models convert text to vectors, but they differ in:
Training data - A model trained on scientific papers understands academic vocabulary. A model trained on e-commerce understands product descriptions. Training data determines what semantic relationships the model captures.
Training objective - Models trained for semantic similarity (detecting paraphrases) behave differently from models trained for retrieval (matching questions to answers). The objective shapes what “similar” means.
Dimensionality - 384, 768, 1024, 1536, 3072 dimensions. Higher is not always better - it depends on your data complexity and resource constraints.
Sequence length - How much text the model can embed at once. 512 tokens is common; newer models support 8192+. If your chunks exceed the model’s max length, they get truncated silently.
Language support - Some models work only for English. Others are multilingual. Multilingual models often sacrifice some English performance for breadth.
graph TD
subgraph factors["Selection Factors"]
F1["Domain Match
Is training data similar to yours?"]
F2["Task Type
Retrieval vs similarity vs clustering?"]
F3["Language
English-only vs multilingual?"]
F4["Dimensions
Accuracy vs cost tradeoff?"]
F5["Max Tokens
Will your chunks fit?"]
F6["Latency
Real-time vs batch?"]
end
style F1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style F2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style F3 fill:#FAEEDA,stroke:#854F0B,color:#633806
style F4 fill:#F1EFE8,stroke:#888780,color:#444441
style F5 fill:#F1EFE8,stroke:#888780,color:#444441
style F6 fill:#F1EFE8,stroke:#888780,color:#444441
Current landscape of embedding models
Commercial APIs
| Model | Dimensions | Max tokens | Strengths |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 (or less via Matryoshka) | 8191 | Versatile, good default, variable dimensions |
| OpenAI text-embedding-3-small | 1536 | 8191 | Cheaper, still strong performance |
| Cohere embed-v3 | 1024 | 512 | Strong multilingual, compression-aware |
| Voyage AI voyage-large-2 | 1536 | 16000 | Code-aware, long context |
| Google text-embedding-004 | 768 | 2048 | Competitive quality, integrated with Vertex |
Open-source models (self-hosted)
| Model | Dimensions | Max tokens | Strengths |
|---|---|---|---|
| BGE-large-en-v1.5 | 1024 | 512 | Top MTEB scores, English |
| E5-mistral-7b-instruct | 4096 | 32768 | Very long context, instruction-following |
| GTE-Qwen2 | 1536 | 8192 | Strong multilingual |
| nomic-embed-text | 768 | 8192 | Long context, open weights |
| jina-embeddings-v3 | 1024 | 8192 | Task-specific prefixes, multilingual |
Domain-specific models
- Legal: SaulLM embeddings, legal-bert
- Medical: PubMedBERT, BioGPT embeddings
- Code: Voyage-code-2, CodeBERT, StarEncoder
- Financial: FinBERT, Bloomberg GPT embeddings
How to evaluate embedding models
Step 1: Define your retrieval task
What are users searching for? What should they find?
- Question → Answer: User asks a question, retrieval finds the answer passage
- Query → Document: User describes a need, retrieval finds relevant documents
- Document → Document: Find documents similar to a given document
- Short → Long: Short queries matching long passages (asymmetric)
Step 2: Build an evaluation dataset
Create 100-200 (query, relevant_document) pairs from your actual data:
eval_set = [
{"query": "What's the refund policy for annual plans?",
"relevant_doc_ids": ["policy-doc-42", "faq-refunds-3"]},
{"query": "How to configure SSO with Okta?",
"relevant_doc_ids": ["sso-guide-7"]},
# ... 100+ pairs
]
Step 3: Measure retrieval metrics
For each model candidate:
- Embed all documents in your corpus
- For each eval query, retrieve top-k results
- Compute metrics:
def evaluate_model(model, eval_set, corpus, k=10):
recall_at_k = []
mrr = []
for item in eval_set:
query_embedding = model.embed(item["query"])
results = search(query_embedding, corpus, top_k=k)
result_ids = [r.id for r in results]
# Recall@K: what fraction of relevant docs appear in top-k?
hits = len(set(result_ids) & set(item["relevant_doc_ids"]))
recall_at_k.append(hits / len(item["relevant_doc_ids"]))
# MRR: where does the first relevant doc appear?
for rank, rid in enumerate(result_ids, 1):
if rid in item["relevant_doc_ids"]:
mrr.append(1.0 / rank)
break
return {"recall@10": mean(recall_at_k), "mrr": mean(mrr)}
Step 4: Factor in operational costs
Winning on recall by 2% but costing 10x more might not be worth it:
Model A: recall@10 = 0.89, cost = $0.02/1K tokens, latency = 50ms
Model B: recall@10 = 0.91, cost = $0.20/1K tokens, latency = 200ms
Model C: recall@10 = 0.87, cost = $0.00 (self-hosted), latency = 20ms
graph LR
subgraph eval["Evaluation Process"]
E1["Define task type"]
E2["Build eval dataset
(100+ query-doc pairs)"]
E3["Benchmark candidates
(recall, MRR, latency)"]
E4["Factor costs
(API vs self-hosted)"]
E5["Select winner"]
end
E1 --> E2 --> E3 --> E4 --> E5
style E1 fill:#F1EFE8,stroke:#888780,color:#444441
style E2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style E3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style E4 fill:#FAEEDA,stroke:#854F0B,color:#633806
style E5 fill:#E1F5EE,stroke:#0F6E56,color:#085041
Where model selection gets interesting
Instruction-tuned embedding models
Newer models (E5-instruct, Jina v3) accept a task prefix that tells the model what type of embedding to produce:
# Same model, different behavior based on prefix
query_embedding = model.embed("query: What causes high latency?")
doc_embedding = model.embed("passage: Network congestion increases response times")
This handles the asymmetric search problem (short query vs long document) without needing two separate models.
Matryoshka representations
Models like OpenAI’s text-embedding-3 produce vectors where the first N dimensions are a valid (lower-quality) embedding. You can store the full 3072-dimension vector but search using only the first 256 dimensions for speed, then re-rank with full dimensions:
# Quick search with truncated vectors
coarse_results = search(query[:256], index_256dim, top_k=100)
# Re-rank with full vectors
final_results = rerank_by_full_similarity(query[:3072], coarse_results, top_k=10)
Fine-tuning embeddings
If off-the-shelf models underperform on your domain, fine-tune with your own (query, positive, negative) triplets:
training_data = [
{"query": "payment failed error",
"positive": "Troubleshooting transaction failures...",
"negative": "Setting up payment methods..."}
]
Even 500-1000 high-quality triplets can boost domain-specific recall by 10-20%.
Multilingual considerations
For multilingual RAG, you need a model that maps equivalent concepts in different languages to similar vectors. Test specifically: embed a question in English and its translation in French - do they return the same documents? Models like Cohere multilingual and GTE-Qwen2 handle this well. English-only models will fail completely on non-English queries.
Real-world model choices
- Pinecone - recommends text-embedding-3-small for cost-effective general use, offers built-in inference for common models
- Anthropic - uses Voyage AI embeddings internally for Claude’s retrieval features
- LangChain - supports 20+ embedding providers through a unified interface, making model switching easy
- Hugging Face MTEB leaderboard - the standard benchmark for comparing embedding models across tasks
How to apply in practice
Start with text-embedding-3-small for prototyping. It is cheap, has long context (8K tokens), and performs well across most domains. Only switch when your eval shows specific weaknesses.
Always evaluate on YOUR data. MTEB leaderboard rankings use academic benchmarks. A model ranked #1 on MTEB might be #5 for your specific domain and query patterns. Build your own eval set.
Match sequence length to chunk size. If your chunks are 800 tokens and your model’s max is 512, you are silently truncating every chunk. Either reduce chunk size or pick a longer-context model.
Consider the embed-once-search-many pattern. You embed documents once during ingestion but embed queries at every request. If query latency matters more than ingestion speed, optimize for fast query embedding (smaller model, lower dimensions) and accept slower document embedding (larger model for quality).
Plan for model upgrades. When a better model comes out (happens every few months), you need to re-embed your entire corpus. Store original text alongside vectors so re-embedding is a pipeline run, not a research project.
FAQ
Q: Should I use the same model for query and document embedding?
Usually yes - models that support asymmetric search with prefixes (E5, Jina) still use the same base model with different prompts. Using genuinely different models for query and document embedding (cross-encoding) is expensive but is exactly what rerankers do as a second stage. For the primary retrieval stage, same-model embedding is standard and works well.
Q: How much does embedding model quality actually matter vs other RAG components?
In a mature RAG pipeline, the embedding model typically accounts for 30-40% of retrieval quality. Chunking strategy is another 30%, reranking adds 15-20%, and query processing (rewriting, expansion) contributes 10-15%. Switching from a mediocre embedding model to a domain-appropriate one might lift recall from 0.75 to 0.85. But fixing chunking could lift it from 0.75 to 0.82 without changing the model. Optimize all components, not just the model.
Q: Self-hosted vs API embedding - when does self-hosting make sense?
Self-host when: (1) data cannot leave your infrastructure (compliance, security), (2) you need >10M embedding calls/month (cost crossover point), (3) you need <10ms latency (API calls add network overhead), or (4) you plan to fine-tune. Use APIs when: prototyping, moderate volume (<1M calls/month), you want zero infrastructure management, or you need the latest models without managing GPU instances.
Interview questions
Q: You are building a multilingual customer support RAG system (English, Spanish, Japanese, German). How do you select and evaluate an embedding model?
Requirements: must map equivalent queries across 4 languages to similar vectors. Evaluation approach: create a cross-lingual eval set - same question in all 4 languages should retrieve the same document. Candidates: Cohere embed-v3 (strong multilingual), GTE-Qwen2, E5-multilingual. Test each on: (1) same-language retrieval (English query → English doc), (2) cross-language retrieval (Spanish query → English doc), (3) language-specific retrieval (Japanese query → Japanese doc). Weight metrics by traffic distribution. Consider: documents might need translation at indexing time if the model struggles with cross-language matching. Fallback: language-specific indexes with query language detection for routing.
Q: Your RAG system uses OpenAI text-embedding-3-large (3072 dims). Storage costs are growing as you scale to 50M vectors. How do you reduce costs without significantly hurting quality?
Matryoshka approach: truncate stored vectors from 3072 to 1024 or 768 dimensions. OpenAI’s model is trained for this - the first N dimensions are a valid embedding. Expected quality loss: 2-5% recall. Storage savings: 67-75%. Implementation: (1) benchmark truncated dimensions on your eval set to find the sweet spot (usually 1024 for minimal quality loss), (2) implement two-stage search - coarse with truncated, rerank with full, (3) consider quantization on top of truncation (int8 instead of float32, another 4x savings). Alternative: switch to a smaller model entirely (text-embedding-3-small at 1536 dims costs less per API call too). Always measure recall impact before committing to cost optimizations.
Q: Compare embedding-based retrieval with BM25 keyword search. When does each win, and how would you combine them?
Embeddings win when: queries and documents use different vocabulary for the same concept, meaning matters more than exact wording, queries are natural-language questions rather than keyword lists. BM25 wins when: exact terms matter (product SKUs, error codes, proper nouns), the domain has specific jargon that general embedding models do not capture, documents are keyword-dense (API references, logs). Combine via hybrid search: run both, merge with reciprocal rank fusion (RRF). Give BM25 weight for structured/technical content, embedding weight for natural language content. A reranker on top of the merged results provides the best of both worlds. In practice, hybrid search outperforms either alone for 80% of real-world RAG workloads.