Evaluating LLM-based Applications

Important Links

rlancemartinUpdated Jan 22, 2024

Key Takeaways

Common types of LLM mistakes

  • Hallucination (confidently stating something incorrect)
  • Bad output formatting
  • Wrong tone
  • Provoked to go "off the rails"
  • Overly cautious ("RLHF responses")
  • Repetitions

How Evaluation helps

  • Validation that your model avoids common failures modes
  • Common language to make fast go / no-go decisions
  • Roadmap for improvements to model performance

What makes a good evaluation

  • Correlated with outcomes
  • A single metric (or a small number of metrics)
  • Fast & automatic to compute

LLM Evaluation

  • We do not have access to training data of LLMs.
  • Prod distribution is always different than training
  • Trained on the internet there is always drift, it doesn't matter as much
  • Qualitative A Hard-to-measure success
  • Diversity of behaviors aggregate metrics don't work
notion image

What's wrong with benchmarks

  • Benchmarks don't measure performance on your use case
  • They don't take into account your prompting, your in-context learning, your fine-tuning, etc
  • They also have all of the measurement issues we're talking about


  • Use LLM to create your own evaluation dataset.
  • Do incrementally and increase the coverage
  • Make reliable evaluation metrics

Evaluation metrics for LLMs

notion image
  • Regular eval metrics
    • Accuracy, etc
  • Reference matching metrics
    • Semantic similarity
    • Ask another LLM, "are these two answers factually consistent" etc
  • "Which is better" metrics
    • Ask an LLM which of the two answers is better, according to any criteria you want
  • "Is the feedback incorporated" metrics
    • Ask an LLM whether the new answer incorporates the feedback on the old answer
  • Static metrics
    • Verify the output has the right structure (e.g., is JSON formatted)
    • Ask a model to grade the answer (on a scale of 1 to 5)

⚠️Disclaimer: All the screenshots, materials, and other media documents used in this article are copyrighted to the original platform or authors.