Evaluation Frameworks (Evals): Testing AI Systems Systematically


You update your system prompt. It fixes the bug one customer reported. You deploy. Three other behaviors break that you did not notice because you only tested the one scenario manually. Two days later, support tickets spike. You revert. This cycle repeats monthly until you build an eval suite - 200 test cases that run automatically before every deployment. Now you catch regressions in 60 seconds instead of 2 days.

Eval frameworks are the testing infrastructure for non-deterministic systems. They solve the fundamental challenge: how do you test something that gives different answers each time and where “correct” is often subjective?

What an eval framework provides

graph TD
  subgraph framework["Eval Framework Components"]
      DS["Dataset
Test cases with expected outputs"]
      RUN["Runner
Executes model against test cases"]
      SCORE["Scorers
Evaluates output quality"]
      REPORT["Reporting
Regression detection, dashboards"]
      CI["CI Integration
Block deploys on failures"]
  end

  DS --> RUN --> SCORE --> REPORT --> CI

  style DS fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style RUN fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style SCORE fill:#FAEEDA,stroke:#854F0B,color:#633806
  style CI fill:#FCEBEB,stroke:#A32D2D,color:#791F1F

Scorer types

Exact match

def exact_match(output, expected):
    return output.strip().lower() == expected.strip().lower()

Best for: classification, extraction of specific values.

Semantic similarity

def semantic_score(output, expected):
    return cosine_similarity(embed(output), embed(expected))

Best for: paraphrases, summaries where wording varies but meaning is same.

LLM-as-judge

def llm_judge(output, expected, criteria):
    prompt = f"Rate this output 1-5 on {criteria}.\nExpected: {expected}\nActual: {output}"
    return float(judge_model.generate(prompt))

Best for: open-ended generation, style, completeness.

Custom rubrics

def rubric_score(output, rubric):
    scores = {}
    scores["contains_citation"] = bool(re.search(r'\[source:', output))
    scores["under_200_words"] = len(output.split()) < 200
    scores["no_hallucination"] = not contains_unsupported_claims(output)
    return sum(scores.values()) / len(scores)

Best for: specific requirements that are programmatically checkable.

Building an eval pipeline

# Define eval dataset
eval_cases = [
    {"input": "What's our refund policy?", 
     "expected": "30-day refund for unused subscriptions",
     "category": "policy_lookup"},
    {"input": "Cancel my subscription", 
     "expected_action": "initiate_cancellation",
     "category": "action_request"},
    # ... 200+ cases
]

# Run evals
async def run_eval_suite(model_config):
    results = []
    for case in eval_cases:
        output = await generate(case["input"], config=model_config)
        scores = {
            "correctness": llm_judge(output, case["expected"], "factual accuracy"),
            "format": format_check(output),
            "latency": measure_latency(),
        }
        results.append({"case": case, "output": output, "scores": scores})
    
    # Compute aggregates
    return {
        "overall_accuracy": mean([r["scores"]["correctness"] for r in results]),
        "by_category": group_scores_by_category(results),
        "regressions": detect_regressions(results, previous_run),
    }

Eval-driven development workflow

  1. User reports issue → add failing test case to eval suite
  2. Fix the prompt/pipeline → run eval suite
  3. Verify fix passes AND no regressions on other cases
  4. Deploy only if eval passes threshold
  5. Monitor production metrics vs eval predictions

This is TDD for AI systems. The eval suite grows with every bug report, building cumulative regression protection.

Real-world eval tools

  • OpenAI Evals - open-source framework for custom eval suites
  • Braintrust - production eval platform with scoring, comparison, and CI integration
  • Promptfoo - CLI tool for testing prompts across models and configurations
  • DeepEval - Python framework with built-in metrics (hallucination, toxicity, relevance)
  • Ragas - specifically for RAG evaluation (faithfulness, context relevance, answer relevance)

How to apply in practice

Start with 50 test cases covering your top failure modes. Expand to 200+ over time by adding every production bug as a test case.

Run evals on every prompt change. Treat prompt modifications like code changes - they need automated testing before deploy.

Set regression thresholds. “Deploy if overall accuracy > 85% AND no category drops > 5% from baseline.” Concrete, enforceable gates.

Separate eval environments. Use deterministic settings (temperature 0, fixed seed) for reproducible eval results, even if production uses higher temperature.

FAQ

Q: How do I eval when there is no single “correct” answer?

Use rubric-based scoring: define 3-5 criteria that a good answer must satisfy (factually grounded, appropriate tone, addresses the question, cites sources). Score each criterion independently. An answer can score well without matching a specific expected output word-for-word.

Q: My evals pass but users are still unhappy. What is wrong?

Your eval cases do not represent real user traffic. Fix: sample 100 real production queries (anonymized), add them to your eval suite, and verify scores correlate with user satisfaction signals. If they diverge, your scoring criteria need updating.

Interview questions

Q: Design the evaluation strategy for a medical Q&A system that answers patient questions about medications. What metrics matter and how do you score them?

Critical metrics: (1) Factual accuracy - scored against verified medical references (must be 98%+). (2) Appropriate disclaimers - every response must include “consult your doctor” when discussing dosage or interactions (binary check). (3) Harm avoidance - must never recommend stopping medication without medical guidance (adversarial test cases). (4) Completeness - addresses all parts of the question. (5) Readability - appropriate for patient audience (no unexplained jargon). Scoring: combine automated checks (disclaimer presence, readability score) with LLM-judge for medical accuracy (using a medical-specialized judge model or validated against reference answers). Red-line threshold: any factual medical error blocks deployment regardless of other scores.