Hallucination Detection: Catching AI Fabrications Before Users See Them
Your legal AI assistant cites “Smith v. Johnson, 2019, 5th Circuit” to support its analysis. The lawyer trusts it and includes it in their brief. The case does not exist. The model fabricated a plausible-sounding citation from patterns in its training data. The lawyer faces sanctions for citing non-existent precedent. This is not a rare edge case - studies show LLMs hallucinate citations 15-30% of the time in legal and academic contexts.
Hallucination detection does not eliminate hallucinations (that requires architectural changes to LLMs themselves). It identifies them after generation so you can filter, flag, or regenerate before the output reaches users.
Types of hallucinations
graph TD
subgraph types["Hallucination Types"]
FC["Factual Fabrication
'Paris has 5 million people'
(incorrect specific claim)"]
CC["Confabulated Citation
'According to Smith 2019...'
(source doesn't exist)"]
EC["Entity Confusion
'CEO John Smith said...'
(wrong person attributed)"]
IC["Intrinsic Contradiction
'It costs $10' then 'the $15 fee'
(self-contradictory)"]
UC["Unsupported Inference
'Therefore users prefer X'
(not supported by context)"]
end
style FC fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style CC fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style EC fill:#FAEEDA,stroke:#854F0B,color:#633806
style IC fill:#FAEEDA,stroke:#854F0B,color:#633806
style UC fill:#EEEDFE,stroke:#534AB7,color:#3C3489
Detection approaches
Context-grounded verification
For RAG systems, check if claims are supported by retrieved context:
def check_groundedness(response, context_docs):
claims = extract_claims(response) # NLI or LLM-based extraction
results = []
for claim in claims:
# Check if any source document supports this claim
supported = any(
entailment_model.check(doc, claim) > 0.7
for doc in context_docs
)
results.append({"claim": claim, "supported": supported})
unsupported = [r for r in results if not r["supported"]]
return {"hallucination_rate": len(unsupported) / len(results), "unsupported_claims": unsupported}
Self-consistency checking
Generate multiple responses and check for agreement:
def self_consistency_check(query, n=3):
responses = [llm.generate(query, temperature=0.7) for _ in range(n)]
claims_per_response = [extract_claims(r) for r in responses]
# Claims that appear in all responses are more likely true
consistent_claims = intersection(claims_per_response)
inconsistent_claims = difference(claims_per_response)
return {"confident": consistent_claims, "suspicious": inconsistent_claims}
LLM-as-judge for faithfulness
judge_prompt = """
Given the source documents and the AI response, identify any claims in the response
that are NOT supported by the source documents.
Source documents: {context}
AI response: {response}
List unsupported claims (or say "None" if all claims are supported):
"""
Logprob-based uncertainty
Low token probabilities (logprobs) correlate with hallucination:
def detect_uncertain_spans(response_with_logprobs):
uncertain_spans = []
for token, logprob in response_with_logprobs:
if logprob < -2.0: # Low confidence threshold
uncertain_spans.append(token)
return uncertain_spans # Flag these for verification
What to do when hallucination is detected
- Remove the claim: Strip unsupported statements from the response
- Add hedging: “Based on available information…” instead of stating as fact
- Flag for user: “Note: I could not verify this claim against available sources”
- Regenerate: Try again with stronger grounding instructions
- Escalate: Route to human review for high-stakes contexts
Real-world detection systems
- Vectara HHEM - open-source hallucination evaluation model
- Ragas Faithfulness - measures if RAG answers are faithful to retrieved context
- Groundedness checks (Azure) - built-in hallucination detection for Azure OpenAI
- Lynx (PatronusAI) - hallucination detection specifically for RAG systems
- SelfCheckGPT - zero-resource hallucination detection via self-consistency
How to apply in practice
For RAG systems: Always check groundedness. If a claim cannot be traced to a retrieved document, either remove it or flag it with lower confidence.
For factual applications: Implement self-consistency checking for critical outputs. If the same question produces different factual answers across runs, flag the inconsistency.
Set domain-appropriate thresholds: A creative writing tool can tolerate some hallucination (it is called “creativity”). A medical information system cannot tolerate any factual errors. Calibrate detection sensitivity to your domain’s risk tolerance.
Do not chase zero hallucination: Some amount is inherent to the architecture. Instead, design systems that are robust to occasional hallucination: citations for verification, confidence indicators, human review for high-stakes outputs.
FAQ
Q: Can we eliminate hallucinations entirely?
Not with current architectures. LLMs generate statistically likely continuations, not verified truths. You can minimize hallucination (RAG, grounding, lower temperature) and detect it post-generation, but elimination requires architectural changes not yet in production (e.g., models with explicit retrieval-then-generate separation, or verified reasoning chains).
Q: Why does the model hallucinate even when correct information is in the context?
Several causes: context is too long and the relevant passage is in the “lost in the middle” zone, the model’s parametric knowledge contradicts the context (and the model trusts itself), or the model’s attention distributes poorly across multiple context passages. Stronger grounding instructions and placing critical context first/last help.
Interview questions
Q: Design a hallucination detection pipeline for a financial news summarization product. False financial information could cause trading losses.
Multi-layer approach: (1) Groundedness: every numerical claim (prices, percentages, dates) must trace to a specific source article. Extract numbers from summary, verify each against source text. (2) Cross-source validation: if the summary references multiple articles, check for contradictions between them before presenting. (3) Entity verification: verify company names, ticker symbols, and executive names against a known database. (4) Temporal consistency: financial data is time-sensitive - verify dates match and are not confusing current vs historical data. (5) Confidence scoring: assign confidence per-claim. Claims verified against multiple sources: high confidence. Single-source claims: medium. Unverifiable claims: strip or flag. Threshold: for financial data, any unverifiable numerical claim is removed rather than presented with uncertainty. The system should never state a stock price, percentage change, or financial metric it cannot directly attribute to a source.