Fine-Tuning vs RAG: When to Teach the Model vs When to Show It

A fintech startup wants their LLM to answer questions about their proprietary trading algorithms. They spend 8 weeks fine-tuning GPT-4 on their internal documentation. The model learns the vocabulary and can discuss their systems fluently. Then a major regulatory change happens. Their documentation updates. The fine-tuned model still gives the old answers because the knowledge is frozen in its weights. They need to fine-tune again - another 2 weeks and $15,000 in compute.

Their competitor built a RAG system instead. Same documentation, same model, same questions. When regulations changed, they updated 3 documents in their knowledge base. Ten minutes later, the system gives correct answers about the new regulations. No retraining. No cost spike. Just updated context.

The fintech startup did not make a “wrong” choice in some absolute sense - fine-tuning is powerful and has legitimate use cases. But they chose it for a problem (knowledge that changes frequently) where RAG is strictly superior. Understanding when to use which approach - and when to combine them - is one of the most consequential architecture decisions in AI engineering.

What each approach actually does

RAG (Retrieval-Augmented Generation): At query time, retrieves relevant documents from an external knowledge base and injects them into the model’s context. The model’s weights are unchanged - it receives the knowledge as input tokens.

Fine-tuning: Modifies the model’s weights by training on domain-specific data. The knowledge becomes part of the model itself, encoded in parameters. The model “learns” the information.

The fundamental distinction: RAG externalizes knowledge (stored in a database, accessed at runtime). Fine-tuning internalizes knowledge (stored in weights, always available without retrieval).

graph TD
  subgraph rag["RAG Approach"]
      R1["Knowledge lives in documents"]
      R2["Retrieved at query time"]
      R3["Model uses context to answer"]
      R4["Update: change documents"]
  end
  subgraph ft["Fine-Tuning Approach"]
      F1["Knowledge lives in model weights"]
      F2["Always available (no retrieval)"]
      F3["Model 'knows' the information"]
      F4["Update: retrain the model"]
  end

  style R1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style R4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style F1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style F4 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

When to use RAG

RAG is the right choice when:

Knowledge changes frequently. Product documentation, pricing, policies, team information - anything that updates weekly or monthly. RAG lets you update the knowledge base without touching the model.

You need attribution and citations. RAG naturally provides source documents for every answer. Users can verify claims by clicking through to the original source. Fine-tuned models cannot tell you where their knowledge came from.

The knowledge base is large. Millions of documents, entire codebases, years of support tickets. You cannot fine-tune all of this into model weights practically, but you can index it for retrieval.

Accuracy and freshness are critical. In domains like healthcare, legal, or finance where outdated or incorrect information has real consequences, the verifiability of RAG (you can trace every answer to a source document) is essential.

You need to respect access controls. Different users should see different information based on their permissions. RAG can filter retrieval by user permissions. Fine-tuned knowledge is accessible to everyone who uses the model.

Fast iteration is important. RAG lets you improve answers by improving documents - no model training required. Technical writers can improve AI quality directly.

When to use fine-tuning

Fine-tuning is the right choice when:

You need to change the model’s behavior/style, not its knowledge. Making the model respond in a specific tone, format, or communication style. Teaching it your company’s brand voice. Making it follow specific output patterns consistently.

Domain-specific vocabulary and reasoning. Medical terminology, legal argumentation patterns, financial modeling conventions - where the model needs to understand how to reason in your domain, not just what facts exist.

You need to reduce prompt size and cost. A fine-tuned model that “knows” your domain needs shorter prompts (no few-shot examples, no lengthy system instructions). At high volume, this saves significant token costs.

Latency matters and retrieval adds too much. Fine-tuned models respond without the overhead of retrieval (50-200ms). For real-time applications where every millisecond counts, eliminating the retrieval step helps.

Consistent behavior on a narrow task. Classification, extraction, or generation tasks where the model should behave identically every time for a well-defined input type. Fine-tuning produces more consistent outputs than prompt-based approaches.

graph LR
  subgraph decision["Decision Framework"]
      Q1["Does the knowledge change often?"]
      Q2["Do you need citations?"]
      Q3["Is it behavior/style, not facts?"]
      Q4["Is latency critical?"]
      Q5["Do you have 1000+ training examples?"]
  end
  subgraph answers["Recommendation"]
      A1["Yes to Q1 or Q2 → RAG"]
      A2["Yes to Q3, Q4, Q5 → Fine-tune"]
      A3["Yes to both → Combine"]
  end

  style A1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style A2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style A3 fill:#FAEEDA,stroke:#854F0B,color:#633806

The combined approach: fine-tuning + RAG

The best production systems often use both:

Fine-tune for behavior - teach the model your output format, reasoning style, domain vocabulary, and tone. A fine-tuned model that understands medical terminology processes retrieved medical documents better than a general model.

RAG for knowledge - provide specific facts, current information, and citable sources at inference time. The fine-tuned model reads and synthesizes the retrieved context more effectively because it understands the domain.

Fine-tuned model (knows HOW to discuss your domain)
  + RAG context (provides WHAT to discuss for this specific query)
  = Best of both worlds

Example: A customer support fine-tuned model knows your company’s communication style, escalation procedures, and technical vocabulary. RAG provides the specific knowledge base articles relevant to each customer’s question.

Cost comparison

RAG costs

Component	One-time	Per-query
Embedding documents	$50-500 (depends on corpus size)	$0
Vector database	$0 (setup)	$50-500/month (hosting)
Embedding queries	$0	$0.0001 per query
Extra context tokens	$0	$0.003-0.01 per query
Total at 100K queries/month	$50-500	$350-1500/month

Fine-tuning costs

Component	One-time	Per-query
Training data preparation	20-40 hours labor	$0
Fine-tuning compute	$500-50,000	$0
Model hosting (if custom)	$0	$1000-5000/month
Retraining (monthly)	$500-5000/month	$0
Total at 100K queries/month	$5,000-50,000	$1000-5000/month

Fine-tuning has higher upfront cost but potentially lower per-query cost (shorter prompts, no retrieval overhead). RAG has lower upfront cost but ongoing per-query costs for retrieval and extra context tokens.

Where each approach breaks

RAG failure modes

Retrieval misses - the answer exists but retrieval does not find it
Context window limits - cannot fit all relevant information
Hallucination despite context - model ignores retrieved docs and invents answers
Stale embeddings - documents change but embeddings are not updated
Latency overhead - retrieval adds 50-200ms per request

Fine-tuning failure modes

Knowledge staleness - model answers reflect training data, not current state
Catastrophic forgetting - fine-tuning on narrow data degrades general capabilities
Hallucinated specifics - model “remembers” training data imprecisely
No attribution - cannot trace answers to specific sources
Expensive iteration - each improvement requires retraining

Real-world approaches

ChatGPT - base model is fine-tuned for conversation + RAG via browsing tool for current information
GitHub Copilot - fine-tuned for code completion + RAG over the user’s repository for project-specific context
Perplexity - base model with RAG over web search results (no fine-tuning)
Bloomberg GPT - heavily fine-tuned on financial data for domain expertise
Notion AI - general model + RAG over workspace content (no fine-tuning)
Medical AI assistants - fine-tuned on medical literature + RAG over drug databases and clinical guidelines

How to apply in practice

Default to RAG. For most applications, RAG is the faster, cheaper, and more maintainable starting point. You can ship a RAG system in days. Fine-tuning takes weeks.

Fine-tune only when you have clear evidence that RAG is insufficient. Specific signals: the model consistently struggles with your domain vocabulary, outputs are formatted inconsistently despite detailed prompts, or prompt engineering has hit diminishing returns.

Prepare fine-tuning data from RAG failures. Use your RAG system’s logs to identify queries where the model gives bad answers despite having good retrieved context. These are exactly the examples that fine-tuning should address - cases where the model needs to learn better reasoning patterns, not more facts.

Never fine-tune for knowledge alone. If the only problem is “the model does not know X,” RAG is always the answer. Fine-tuning for pure knowledge injection is expensive, fragile, and requires retraining when anything changes.

Budget for retraining. If you fine-tune, plan for quarterly or monthly retraining cycles. Your domain evolves, user needs change, and model drift accumulates. A fine-tuned model without a retraining pipeline is a depreciating asset.

FAQ

Q: Can I fine-tune a model to be better at using RAG context?

Yes, and this is one of the most effective uses of fine-tuning. Train the model on (query, context, ideal_answer) triples where it learns to synthesize retrieved context effectively, cite sources correctly, and say “I don’t know” when the context does not contain the answer. This combines the strengths of both approaches: RAG provides the knowledge, fine-tuning teaches the model how to use it optimally.

Q: How much training data do I need for fine-tuning?

Depends on the task scope. For narrow behavior changes (output format, tone): 50-200 examples. For domain adaptation (medical, legal): 500-2000 examples. For teaching complex reasoning patterns: 2000-10000 examples. Quality matters more than quantity - 200 carefully curated examples outperform 2000 noisy ones. OpenAI recommends a minimum of 50 examples to see improvement, with diminishing returns past 500-1000 for most tasks.

Q: My RAG system works well 90% of the time but fails on complex multi-hop questions that require combining information from multiple documents. Would fine-tuning help?

Possibly, but try these RAG improvements first: (1) multi-hop retrieval (iterative retrieval based on intermediate reasoning), (2) chain-of-thought prompting that explicitly guides the model through multi-document synthesis, (3) agentic retrieval where the model decides what else to search for. If these do not help, fine-tuning on multi-hop Q&A examples (showing the model how to combine information from multiple sources) can improve synthesis capability without changing the retrieval approach.

Interview questions

Q: A healthcare company wants to build an AI assistant that helps doctors with diagnosis suggestions based on patient symptoms and medical literature. Should they use RAG, fine-tuning, or both? Justify your architecture.

Both. Fine-tune for: medical reasoning patterns (how to weigh symptoms, when to consider differential diagnoses, appropriate hedging/disclaimers). This teaches the model HOW to think medically. RAG for: current drug interactions, latest clinical guidelines, rare disease information, hospital-specific protocols. This provides WHAT to base recommendations on. Critical requirements: (1) RAG must include citation to specific medical literature (doctors need to verify), (2) fine-tuning must not reduce the model’s ability to say “insufficient evidence” (catastrophic forgetting risk), (3) RAG knowledge base must be updated as guidelines change (monthly for some specialties), (4) the system must clearly indicate confidence level. Fine-tuning alone fails because medical knowledge updates constantly. RAG alone fails because general models lack the reasoning patterns needed for clinical decision-making.

Q: You are building a customer support bot. You have 50,000 resolved support tickets and a 200-page product documentation wiki. Design the approach.

Start with RAG over the documentation wiki (the authoritative knowledge source). Chunk documentation into FAQ-like segments, embed, and build retrieval. For the 50,000 tickets: do not fine-tune on raw tickets (too noisy). Instead: (1) extract high-quality Q&A pairs from resolved tickets (curator reviews top 1000), (2) use these as few-shot examples in prompts OR (3) fine-tune if prompt-based approach plateaus and you have 500+ curated examples. Architecture: RAG for “what is the answer” (documentation retrieval) + optional fine-tuning for “how to respond” (tone, escalation patterns, format). Measure: track resolution rate and escalation rate. If RAG alone achieves >80% resolution, fine-tuning’s marginal improvement may not justify the maintenance cost.

Q: Your fine-tuned model gives great answers about company products but sometimes “forgets” general knowledge (basic math, common sense reasoning). What happened and how do you fix it?

Catastrophic forgetting: fine-tuning on narrow domain data caused the model to lose some general capabilities. Fixes: (1) Use LoRA or other parameter-efficient fine-tuning (fewer parameters changed = less forgetting). (2) Mix general-purpose data into your fine-tuning dataset (10-20% general instruction-following examples alongside your domain data). (3) Reduce training duration - over-training on domain data increases forgetting. (4) Use a larger base model (more parameters = more capacity to retain general knowledge while learning domain knowledge). (5) For critical general capabilities (math), verify them in your eval suite and stop fine-tuning when they degrade below threshold. Prevention is easier than cure - monitor general benchmarks during fine-tuning and stop early if they decline.