Hallucination Detection: Catching AI Fabrications Before Users See Them


Your legal AI assistant cites “Smith v. Johnson, 2019, 5th Circuit” to support its analysis. The lawyer trusts it and includes it in their brief. The case does not exist. The model fabricated a plausible-sounding citation from patterns in its training data. The lawyer faces sanctions for citing non-existent precedent. This is not a rare edge case - studies show LLMs hallucinate citations 15-30% of the time in legal and academic contexts.

Hallucination detection does not eliminate hallucinations (that requires architectural changes to LLMs themselves). It identifies them after generation so you can filter, flag, or regenerate before the output reaches users.

Types of hallucinations

graph TD
  subgraph types["Hallucination Types"]
      FC["Factual Fabrication
'Paris has 5 million people'
(incorrect specific claim)"]
      CC["Confabulated Citation
'According to Smith 2019...'
(source doesn't exist)"]
      EC["Entity Confusion
'CEO John Smith said...'
(wrong person attributed)"]
      IC["Intrinsic Contradiction
'It costs $10' then 'the $15 fee'
(self-contradictory)"]
      UC["Unsupported Inference
'Therefore users prefer X'
(not supported by context)"]
  end

  style FC fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style CC fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
  style EC fill:#FAEEDA,stroke:#854F0B,color:#633806
  style IC fill:#FAEEDA,stroke:#854F0B,color:#633806
  style UC fill:#EEEDFE,stroke:#534AB7,color:#3C3489

Detection approaches

Context-grounded verification

For RAG systems, check if claims are supported by retrieved context:

def check_groundedness(response, context_docs):
    claims = extract_claims(response)  # NLI or LLM-based extraction
    
    results = []
    for claim in claims:
        # Check if any source document supports this claim
        supported = any(
            entailment_model.check(doc, claim) > 0.7
            for doc in context_docs
        )
        results.append({"claim": claim, "supported": supported})
    
    unsupported = [r for r in results if not r["supported"]]
    return {"hallucination_rate": len(unsupported) / len(results), "unsupported_claims": unsupported}

Self-consistency checking

Generate multiple responses and check for agreement:

def self_consistency_check(query, n=3):
    responses = [llm.generate(query, temperature=0.7) for _ in range(n)]
    claims_per_response = [extract_claims(r) for r in responses]
    
    # Claims that appear in all responses are more likely true
    consistent_claims = intersection(claims_per_response)
    inconsistent_claims = difference(claims_per_response)
    
    return {"confident": consistent_claims, "suspicious": inconsistent_claims}

LLM-as-judge for faithfulness

judge_prompt = """
Given the source documents and the AI response, identify any claims in the response
that are NOT supported by the source documents.

Source documents: {context}
AI response: {response}

List unsupported claims (or say "None" if all claims are supported):
"""

Logprob-based uncertainty

Low token probabilities (logprobs) correlate with hallucination:

def detect_uncertain_spans(response_with_logprobs):
    uncertain_spans = []
    for token, logprob in response_with_logprobs:
        if logprob < -2.0:  # Low confidence threshold
            uncertain_spans.append(token)
    return uncertain_spans  # Flag these for verification

What to do when hallucination is detected

  1. Remove the claim: Strip unsupported statements from the response
  2. Add hedging: “Based on available information…” instead of stating as fact
  3. Flag for user: “Note: I could not verify this claim against available sources”
  4. Regenerate: Try again with stronger grounding instructions
  5. Escalate: Route to human review for high-stakes contexts

Real-world detection systems

  • Vectara HHEM - open-source hallucination evaluation model
  • Ragas Faithfulness - measures if RAG answers are faithful to retrieved context
  • Groundedness checks (Azure) - built-in hallucination detection for Azure OpenAI
  • Lynx (PatronusAI) - hallucination detection specifically for RAG systems
  • SelfCheckGPT - zero-resource hallucination detection via self-consistency

How to apply in practice

For RAG systems: Always check groundedness. If a claim cannot be traced to a retrieved document, either remove it or flag it with lower confidence.

For factual applications: Implement self-consistency checking for critical outputs. If the same question produces different factual answers across runs, flag the inconsistency.

Set domain-appropriate thresholds: A creative writing tool can tolerate some hallucination (it is called “creativity”). A medical information system cannot tolerate any factual errors. Calibrate detection sensitivity to your domain’s risk tolerance.

Do not chase zero hallucination: Some amount is inherent to the architecture. Instead, design systems that are robust to occasional hallucination: citations for verification, confidence indicators, human review for high-stakes outputs.

FAQ

Q: Can we eliminate hallucinations entirely?

Not with current architectures. LLMs generate statistically likely continuations, not verified truths. You can minimize hallucination (RAG, grounding, lower temperature) and detect it post-generation, but elimination requires architectural changes not yet in production (e.g., models with explicit retrieval-then-generate separation, or verified reasoning chains).

Q: Why does the model hallucinate even when correct information is in the context?

Several causes: context is too long and the relevant passage is in the “lost in the middle” zone, the model’s parametric knowledge contradicts the context (and the model trusts itself), or the model’s attention distributes poorly across multiple context passages. Stronger grounding instructions and placing critical context first/last help.

Interview questions

Q: Design a hallucination detection pipeline for a financial news summarization product. False financial information could cause trading losses.

Multi-layer approach: (1) Groundedness: every numerical claim (prices, percentages, dates) must trace to a specific source article. Extract numbers from summary, verify each against source text. (2) Cross-source validation: if the summary references multiple articles, check for contradictions between them before presenting. (3) Entity verification: verify company names, ticker symbols, and executive names against a known database. (4) Temporal consistency: financial data is time-sensitive - verify dates match and are not confusing current vs historical data. (5) Confidence scoring: assign confidence per-claim. Claims verified against multiple sources: high confidence. Single-source claims: medium. Unverifiable claims: strip or flag. Threshold: for financial data, any unverifiable numerical claim is removed rather than presented with uncertainty. The system should never state a stock price, percentage change, or financial metric it cannot directly attribute to a source.