Reranking: The Secret Weapon for RAG Precision
Your RAG system retrieves the top 10 chunks for a user’s question. Chunks 1 and 2 are marginally relevant - they mention the right topic but do not answer the question. Chunk 7, buried in the results, contains the exact answer. The LLM reads all 10 chunks but gives the most attention to chunks 1 and 2 (they come first). The response is vague and generic because the model latched onto the wrong context.
You add a reranker between retrieval and generation. It re-scores the 10 chunks by actually reading each one alongside the query. Chunk 7 jumps to position 1. Chunks 1 and 2 drop to positions 6 and 8. Now the LLM sees the best context first. The response is precise and directly answers the question.
Reranking is often a 10-15% improvement in answer quality with minimal architectural change. It is the single highest-ROI upgrade most RAG systems can make after basic retrieval is working.
What reranking actually is
Reranking is a second-stage retrieval step that takes a set of candidate documents (from first-stage retrieval) and re-scores them for relevance to the query using a more expensive but more accurate model.
The key difference from first-stage retrieval:
First-stage (bi-encoder): Embeds query and documents independently, compares via dot product. Fast (can search millions of vectors in milliseconds) but imprecise (independent encoding misses query-document interactions).
Reranker (cross-encoder): Processes query and document together in a single forward pass, allowing full attention between all query tokens and all document tokens. Slow (must run once per candidate) but highly accurate (sees all interactions).
graph TD
subgraph stage1["Stage 1: Retrieval (Fast, Broad)"]
Q1["Query"] --> EMB["Bi-Encoder
(independent embedding)"]
EMB --> ANN["ANN Search
10M docs → top 50"]
end
subgraph stage2["Stage 2: Reranking (Slow, Precise)"]
Q2["Query + Doc pairs"] --> CE["Cross-Encoder
(joint encoding)"]
CE --> SCORE["Re-score
top 50 → top 5"]
end
ANN --> Q2
style EMB fill:#E1F5EE,stroke:#0F6E56,color:#085041
style ANN fill:#E1F5EE,stroke:#0F6E56,color:#085041
style CE fill:#EEEDFE,stroke:#534AB7,color:#3C3489
style SCORE fill:#EEEDFE,stroke:#534AB7,color:#3C3489
How cross-encoders work
A bi-encoder embeds query and document separately - they never “see” each other. A cross-encoder concatenates them and processes them together:
Bi-encoder:
query_vec = encode("What causes memory leaks?")
doc_vec = encode("Circular references prevent garbage collection...")
score = dot_product(query_vec, doc_vec)
Cross-encoder:
score = model("[CLS] What causes memory leaks? [SEP] Circular references prevent garbage collection... [SEP]")
The cross-encoder’s full attention mechanism lets it:
- Attend from “memory leaks” in the query to “garbage collection” in the document
- Understand negation: “This is NOT about memory leaks” gets a low score
- Weigh how directly the document answers the specific question (not just topical similarity)
This is why cross-encoders are more accurate: they have complete information about the query-document relationship. But they cannot be pre-computed (you need the query to run them), which is why they are only used on a small candidate set.
The retrieve-then-rerank pipeline
def retrieve_and_rerank(query, top_k=5, candidates=50):
# Stage 1: Fast retrieval (milliseconds)
query_embedding = embed(query)
candidates = vector_search(query_embedding, top_k=candidates)
# Stage 2: Precise reranking (hundreds of ms)
pairs = [(query, doc.text) for doc in candidates]
scores = reranker.score(pairs)
# Sort by reranker score
reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in reranked[:top_k]]
Candidate count: Retrieve 20-100 candidates in stage 1. More candidates give the reranker more to work with, but cost more to re-score. The sweet spot depends on your first-stage recall and reranker throughput.
Final k: Return 3-5 reranked results to the LLM. The reranker ensures these are the truly best matches.
graph LR
subgraph compare["Bi-Encoder vs Cross-Encoder"]
direction TB
BI["Bi-Encoder
• Independent encoding
• Pre-computable
• Fast (millions/sec)
• Lower accuracy"]
CR["Cross-Encoder
• Joint encoding
• Needs query at runtime
• Slow (100-1000/sec)
• Higher accuracy"]
end
style BI fill:#E1F5EE,stroke:#0F6E56,color:#085041
style CR fill:#EEEDFE,stroke:#534AB7,color:#3C3489
Available reranker models
Commercial APIs
| Model | Latency | Strengths |
|---|---|---|
| Cohere Rerank v3 | ~100ms for 25 docs | Best general-purpose, multilingual |
| Voyage Rerank | ~80ms for 25 docs | Strong for code and technical content |
| Jina Reranker v2 | ~60ms for 25 docs | Fast, good quality/cost ratio |
Open-source (self-hosted)
| Model | Size | Strengths |
|---|---|---|
| bge-reranker-v2-m3 | 568M params | Strong multilingual, MTEB top scores |
| ms-marco-MiniLM-L-12 | 33M params | Fast, good for English |
| cross-encoder/ms-marco-electra | 110M params | Balanced quality/speed |
| mixedbread-ai/mxbai-rerank | 137M params | Competitive with commercial |
LLM-based reranking
Use an LLM itself as a reranker by asking it to judge relevance:
prompt = f"""
Rate the relevance of this document to the query on a scale of 0-10.
Query: {query}
Document: {document}
Relevance score:
"""
score = float(llm.generate(prompt))
More expensive but can be more accurate for complex relevance judgments. Use for high-stakes applications where a 33M parameter reranker is not sufficient.
Where reranking breaks or gets interesting
Latency budget
Reranking adds 50-200ms per request (depending on model size and candidate count). For real-time applications with strict latency SLAs (<200ms total), you need to budget carefully:
- First-stage retrieval: 10-30ms
- Reranking 20 candidates: 50-150ms
- LLM generation: 500-2000ms (dominates anyway)
Since LLM generation typically dominates latency, the reranking overhead is usually acceptable.
Cost at scale
Reranking N candidates per query means N forward passes through the model. At 100 queries/second with 50 candidates each, that is 5,000 reranker inferences/second. Self-hosting a reranker model is often more cost-effective than API calls at this scale.
The candidate pool problem
If the relevant document is not in your initial candidates (because first-stage retrieval missed it), reranking cannot help - it only re-orders what you already have. Reranking improves precision, not recall. If recall is your problem, fix your first-stage retrieval.
Score calibration
Cross-encoder scores are relative, not absolute. A score of 0.8 does not mean “80% relevant” - it means “more relevant than anything scoring below 0.8 for this specific query.” You cannot use a fixed threshold across queries. What you can do: use scores for ranking and relative confidence, not absolute judgments.
Passage length sensitivity
Some rerankers are biased toward longer passages (more tokens = more potential matches). If your chunks vary significantly in length, this bias can cause longer chunks to rank higher regardless of relevance. Mitigate by normalizing for length or using models trained to be length-invariant.
Advanced reranking patterns
Multi-stage reranking
For very large candidate sets, use cascading rerankers of increasing quality:
Stage 1: ANN retrieval → 1000 candidates
Stage 2: Fast reranker (33M params) → top 50
Stage 3: Strong reranker (568M params) → top 10
Stage 4: LLM-as-judge → top 3 (optional, for high-stakes)
Listwise reranking
Instead of scoring each document independently, some approaches score documents relative to each other:
prompt = """
Given the query: "{query}"
Rank these documents from most to least relevant:
[1] {doc_1}
[2] {doc_2}
[3] {doc_3}
...
"""
# Model outputs: [3, 1, 2] (ranking order)
Listwise reranking can capture relative differences better but is more expensive and harder to parallelize.
Diversity-aware reranking
Standard reranking might surface 5 documents that all say the same thing (highest relevance, but redundant). Maximal Marginal Relevance (MMR) balances relevance with diversity:
def mmr_rerank(query, candidates, lambda_param=0.7, k=5):
selected = []
remaining = candidates.copy()
while len(selected) < k and remaining:
best_score = -inf
best_doc = None
for doc in remaining:
relevance = similarity(query, doc)
redundancy = max(similarity(doc, s) for s in selected) if selected else 0
mmr_score = lambda_param * relevance - (1 - lambda_param) * redundancy
if mmr_score > best_score:
best_score = mmr_score
best_doc = doc
selected.append(best_doc)
remaining.remove(best_doc)
return selected
Real-world reranking usage
- Perplexity - reranks web search results before feeding to the generation model
- Cohere - provides Rerank as a standalone API, used in hundreds of production RAG systems
- Pinecone - integrated reranking in their RAG pipeline offerings
- Google Search - uses BERT-based reranking (since 2019) to re-score initial keyword results
- Amazon Product Search - multi-stage ranking from millions of products to the displayed results
How to apply in practice
Add reranking as your first RAG quality upgrade. If you have basic retrieval working and want to improve answer quality, adding a reranker typically gives the biggest single improvement. It is a drop-in addition - no changes to chunking, embedding, or generation needed.
Retrieve broadly, rerank precisely. Get top-50 from first stage, rerank to top-5. The over-retrieval ensures the reranker has good candidates to work with. Under-retrieving (top-5 then reranking top-5) gives the reranker nothing useful to do.
Benchmark reranking impact on YOUR data. On well-tuned retrieval systems, reranking might only add 3-5%. On poorly-tuned systems, it might add 15-20%. Measure before and after to justify the latency/cost.
Consider skipping reranking for simple queries. If the first-stage retrieval returns a top result with very high similarity (>0.95), the reranker is unlikely to change the ranking. Route simple queries past the reranker to save latency.
FAQ
Q: If cross-encoders are more accurate, why not use them as the primary retrieval method?
Because they require the query at inference time and cannot pre-compute document representations. To search 10M documents, you would need 10M forward passes per query. At 100ms per pass, that is 11.5 days per query. Bi-encoders pre-compute document embeddings, so search is just a vector comparison (microseconds each). The two-stage approach gives you the speed of bi-encoders and the accuracy of cross-encoders.
Q: How many candidates should I retrieve for the reranker?
Depends on your first-stage recall. If first-stage recall@50 is 95% (the relevant doc is in the top 50 results 95% of the time), retrieving 50 candidates is appropriate. If recall@50 is only 80%, you need to retrieve more (100+) or fix first-stage retrieval. Practical range: 20-100 candidates. Cost scales linearly with candidate count, so find the minimum that achieves acceptable recall.
Q: Can I use a reranker without a vector database? Just BM25 + reranker?
Yes, and this is actually a strong baseline. BM25 retrieves candidates cheaply (no embedding needed), and the reranker provides the semantic understanding. For many use cases, BM25 + cross-encoder reranker performs comparably to vector search + reranker, with simpler infrastructure. The limitation: BM25 cannot find documents with zero vocabulary overlap with the query, even if the reranker would have scored them highly. Hybrid (BM25 + vector) + reranker is the gold standard.
Interview questions
Q: Your RAG system retrieves 10 chunks and feeds them to the LLM, but the answers are often based on chunks ranked 5-10 rather than 1-3. The top chunks are topically related but do not actually answer the question. How do you fix this?
Classic case where first-stage retrieval (bi-encoder) provides topic-level matching but not answer-level precision. Solution: add a cross-encoder reranker between retrieval and generation. Retrieve 30-50 candidates (broader pool), rerank them specifically for “does this chunk answer this question” (not just topical similarity), then feed only the top 3-5 reranked chunks to the LLM. Expected improvement: 10-20% better answer accuracy. Implementation: start with Cohere Rerank API for fast integration, measure impact on your eval set, then consider self-hosting for cost optimization at scale.
Q: Design the retrieval pipeline for a legal research tool where precision matters more than recall (lawyers need the most relevant cases, not all potentially relevant cases).
Precision-focused pipeline: (1) First stage: hybrid search (vector + BM25) over case law database, retrieve top-100 candidates. (2) Second stage: strong cross-encoder reranker (bge-reranker-v2-m3 or Cohere Rerank), re-score all 100 candidates, keep top-10. (3) Third stage: LLM-as-judge to evaluate the top-10 for specific relevance to the legal question (“Does this case establish precedent for the specific issue?”), keep top-3. (4) Diversity: apply MMR to ensure the 3 results are not all from the same case or jurisdiction. (5) Present with confidence scores and highlighting of the relevant passages within each case. The multi-stage approach progressively increases precision at each step while remaining computationally feasible.
Q: Compare the cost-effectiveness of improving your embedding model vs adding a reranker. When would you invest in each?
Embedding model improvement (fine-tuning or switching models): improves recall - gets better candidates into the pool. Worth investing when: your first-stage recall is below 85%, your domain is specialized and general models underperform, or you have training data for fine-tuning. Cost: one-time fine-tuning + re-embedding corpus. Reranker addition: improves precision - better ordering of existing candidates. Worth investing when: first-stage recall is already good (>85%) but the top results are not optimally ordered, or you need to compress 10 retrieved chunks to the best 3 for context window efficiency. Cost: per-query inference cost. Decision framework: if your eval shows the relevant document IS in top-20 but NOT in top-3, add a reranker. If the relevant document is NOT in top-20 at all, fix your embedding/retrieval first.