RAG Architecture: Grounding LLMs in Your Data

Your company deploys an internal chatbot powered by GPT-4. Employees ask it questions about company policies, product features, and technical documentation. The answers sound confident and well-structured. They are also wrong 30% of the time. The model hallucinates a parental leave policy that does not exist. It describes a product feature that was deprecated two years ago. It cites an internal document that was never written.

The model is not broken - it is working exactly as designed. It generates plausible text based on its training data, which knows nothing about your company’s specific policies, products, or documentation. You need to inject your knowledge into the model’s generation process without retraining it.

This is what RAG solves. Instead of hoping the model knows the answer, you retrieve the relevant information from your knowledge base and feed it directly into the prompt. The model generates its answer using your documents as source material. It goes from “I will invent something plausible” to “I will answer based on what these documents say.”

What RAG actually is

Retrieval-Augmented Generation is an architecture pattern where an LLM’s response is grounded in externally retrieved context. The system retrieves relevant documents from a knowledge base, inserts them into the prompt, and the model generates a response based on that retrieved context.

The three core components:

Retrieval - Find relevant documents for the query
Augmentation - Inject retrieved documents into the model’s context
Generation - Model produces a response using both its parametric knowledge and the retrieved context

graph TD
  Q["User Query"] --> E["Embed Query"]
  E --> VS["Vector Search"]
  KB["Knowledge Base
(chunked + embedded)"] --> VS
  VS --> R["Top-K Relevant Chunks"]
  R --> P["Augmented Prompt
= System + Chunks + Query"]
  P --> LLM["LLM Generation"]
  LLM --> A["Grounded Answer
(with citations)"]

  style Q fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style KB fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style R fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style LLM fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style A fill:#FAEEDA,stroke:#854F0B,color:#633806

The RAG pipeline step by step

Step 1: Ingestion (offline)

Before any user query, you prepare your knowledge base:

Collect documents - policies, docs, wikis, code, tickets, whatever your users need to ask about
Chunk documents - split into meaningful segments (paragraphs, sections, or semantic units)
Embed chunks - convert each chunk into a vector using an embedding model
Store in vector database - index embeddings for fast similarity search
Store metadata - source URL, last updated date, section title, access permissions

Step 2: Query processing (online)

When a user asks a question:

Embed the query - convert user’s question to a vector using the same embedding model
Search - find the k most similar chunk vectors (nearest neighbor search)
Rerank (optional) - use a cross-encoder to re-score the top candidates for more precise relevance
Filter - apply metadata filters (date range, access permissions, source type)

Step 3: Augmentation (online)

Construct the prompt:

System: You are a helpful assistant. Answer questions using ONLY the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
---
[Chunk 1: Company leave policy, updated 2024-03-15]
Employees are entitled to 16 weeks of paid parental leave...
---
[Chunk 2: HR FAQ, updated 2024-01-20]
Parental leave applies to all full-time employees after 6 months...
---

Question: What is our parental leave policy?

Step 4: Generation (online)

The model generates a response grounded in the provided context. Because the relevant documents are right there in the prompt, the model can cite specific details rather than hallucinating.

graph LR
  subgraph ingestion["Ingestion Pipeline (Offline)"]
      D["Documents"] --> CH["Chunking"]
      CH --> EM["Embedding"]
      EM --> VDB["Vector DB"]
  end
  subgraph query["Query Pipeline (Online)"]
      UQ["User Query"] --> QE["Query Embedding"]
      QE --> SS["Similarity Search"]
      VDB --> SS
      SS --> RR["Reranking"]
      RR --> AG["Prompt Assembly"]
      AG --> GEN["LLM Generation"]
      GEN --> ANS["Answer + Citations"]
  end

  style D fill:#F1EFE8,stroke:#888780,color:#444441
  style VDB fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style UQ fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style ANS fill:#EEEDFE,stroke:#534AB7,color:#3C3489

Where RAG breaks

Retrieval failure

The most common RAG failure: the right document exists in your knowledge base but the retrieval step does not find it. Causes:

Vocabulary mismatch - user says “PTO” but the document says “paid time off”
Query too vague - “how does billing work?” matches too many documents and the relevant one is ranked low
Chunk boundary issues - the answer spans two chunks and neither alone is sufficient
Embedding model weakness - the model does not capture the semantic relationship between query and document

Retrieved but not used

The model ignores retrieved context and relies on its parametric knowledge instead. This happens when:

The context is too long (lost in the middle)
The instruction to use context is too weak
The model’s parametric knowledge contradicts the retrieved context (and the model trusts itself more)

Stale knowledge

Your knowledge base was indexed 3 months ago. Company policies changed last week. The RAG system confidently cites outdated information because it is still in the vector database. You need an update pipeline.

Hallucinated citations

The model generates an answer and “cites” a document that does not exist, or attributes information to the wrong retrieved chunk. This is especially insidious because it looks authoritative.

Cross-document reasoning

A question that requires combining information from multiple documents is harder. “How does our parental leave policy differ between US and UK offices?” requires retrieving and synthesizing from two separate policy documents. Simple top-k retrieval might only find one.

Advanced RAG patterns

Hybrid search

Combine vector similarity search with keyword search (BM25). Vector search catches semantic matches; keyword search catches exact term matches. Merge results with reciprocal rank fusion:

vector_results = vector_db.search(query_embedding, top_k=20)
keyword_results = bm25_search(query_text, top_k=20)
merged = reciprocal_rank_fusion(vector_results, keyword_results)
final = merged[:5]

Query transformation

Rewrite the user’s query before retrieval to improve recall:

# HyDE: Hypothetical Document Embedding
hypothetical_answer = llm.generate(f"Write a short answer to: {query}")
search_embedding = embed(hypothetical_answer)  # Embed the hypothetical answer
results = vector_db.search(search_embedding)   # Search with it

Multi-hop retrieval

For questions requiring information from multiple sources, retrieve iteratively:

First retrieval based on the original query
Generate an intermediate answer
Identify what information is still missing
Second retrieval for the missing pieces
Generate final answer with all context

Parent document retrieval

Store small chunks for precise retrieval but return the parent document (or larger section) for context. This solves the chunk boundary problem:

Index: small chunks (200 tokens) for retrieval precision
Return: parent chunk (1000 tokens) for generation context

Self-RAG

The model decides whether it needs retrieval, retrieves if needed, and evaluates whether the retrieved content is relevant before using it. More autonomous but more complex.

Real-world RAG systems

Perplexity - web-scale RAG that retrieves from the internet, summarizes sources, and generates cited answers
GitHub Copilot Chat - RAG over the user’s codebase (indexed locally) to answer questions about specific projects
Notion AI - RAG over workspace content to answer questions about company knowledge
Amazon Q Business - enterprise RAG connecting to 40+ data sources with permission-aware retrieval
ChatGPT with browsing - retrieves from the web when the model detects it needs current information

How to apply RAG in practice

Start simple. Basic RAG (embed, search, generate) gets you 80% of the value. Add complexity (reranking, hybrid search, query transformation) only when your eval shows specific retrieval failures.

Measure retrieval separately from generation. If the model gives bad answers, is it because retrieval failed (wrong documents) or generation failed (right documents but bad synthesis)? Different problems, different fixes.

Chunk with intent. Do not blindly split at 500 tokens. Chunk at natural boundaries (sections, paragraphs, FAQ entries). Each chunk should be self-contained enough to answer a question without needing adjacent chunks.

Include metadata. Store and return metadata with chunks: source title, URL, last updated date, section heading. This enables filtering and citation.

Instruct the model explicitly. “Answer ONLY based on the provided context. If the information is not in the context, say so.” Without this instruction, models happily supplement with parametric knowledge (which may be wrong for your domain).

Plan for updates. Documents change. Build an incremental ingestion pipeline that detects changes, re-chunks modified documents, re-embeds them, and updates the vector database. Daily or real-time, depending on how often your knowledge base changes.

FAQ

Q: RAG vs fine-tuning - when do I use which?

RAG when: knowledge changes frequently, you need citations/attribution, you need to add new information without retraining, or the knowledge base is large (millions of documents). Fine-tuning when: you need the model to learn a specific style/format, domain vocabulary, or reasoning patterns that persist across all queries. For most applications, RAG is the right starting point. Fine-tuning complements RAG for style adaptation but rarely replaces it for knowledge injection.

Q: How many chunks should I retrieve? What is the right k?

Start with 3-5 chunks. More chunks provide more coverage but dilute attention and increase cost/latency. The optimal k depends on your chunk size, context window budget, and query complexity. For factual lookups (one chunk likely has the answer): k=3. For synthesis questions (answer requires multiple sources): k=5-8. For exploratory questions (breadth over depth): k=8-12 with summarization. Always measure: if increasing k does not improve your eval scores, you are wasting tokens.

Q: My RAG system gives correct answers but users do not trust them. How do I improve trust?

Add citations. Show which documents the answer came from with links to the source. Display relevant quotes from the source material alongside the answer. Add confidence indicators - if the retrieval score is low, explicitly say “I found limited information about this.” Let users verify by clicking through to the original document. Trust comes from transparency and verifiability.

Interview questions

Q: Design a RAG system for a company with 50,000 internal documents (policies, technical docs, meeting notes). New documents are added daily.

Architecture: (1) Ingestion: watch document sources (Confluence, SharePoint, Google Drive) for changes via webhooks. Chunk new/modified documents (500-800 token chunks with overlap). Embed using a domain-appropriate model. Store in a managed vector database (Pinecone/Qdrant) with metadata (source, date, author, permissions). (2) Query: embed user query, hybrid search (vector + BM25 over metadata), rerank top-20 to top-5, filter by user permissions. (3) Generation: structured prompt with context, explicit grounding instructions, and citation requirements. (4) Freshness: incremental re-indexing pipeline runs hourly. Document change detection uses checksums. Deleted documents are removed from the index. (5) Quality: log queries with low retrieval confidence for review. Track “I don’t know” responses as signals of knowledge gaps.

Q: Your RAG system retrieves the right documents but the model’s answers are still incorrect. The documents clearly contain the correct information. Diagnose the problem.

Multiple possible causes: (1) Lost in the middle - correct info is in chunk 3 of 5, model attends to chunks 1 and 5 only. Fix: reorder chunks by relevance, put most relevant first. (2) Conflicting information - another chunk contains outdated or contradictory info that confuses the model. Fix: deduplicate, add recency weighting, include document dates in context. (3) Weak grounding instruction - model supplements with parametric knowledge that overrides retrieved context. Fix: stronger instruction: “Answer ONLY from the provided documents. Quote relevant passages.” (4) Chunk too fragmented - the answer requires combining information scattered across chunks, and the model fails to synthesize. Fix: use larger chunks or parent-document retrieval. (5) The generation model itself struggles with the domain’s reasoning patterns. Fix: try a more capable model or add chain-of-thought instructions.

Q: Compare naive RAG (embed, search, generate) with advanced RAG (hybrid search, reranking, query transformation). When is the additional complexity justified?

Naive RAG is sufficient when: documents are well-written and self-contained, user queries are clear and specific, the vocabulary overlaps well between queries and documents, and accuracy requirements are moderate (80-85% is acceptable). Advanced RAG is justified when: users ask vague or complex questions, domain vocabulary is specialized (medical, legal), accuracy requirements are high (>90%), the knowledge base is large (>10K documents) with many similar-but-different entries, or you see specific retrieval failures in your eval. The decision should be data-driven: run your eval on naive RAG, identify failure categories, and add complexity that specifically addresses those failures. Do not add reranking because it sounds good - add it because your eval shows retrieval precision is the bottleneck.