Evaluating LLM-based Applications

Home / Training & Certifications / Training Vault

Important Links

auto-evaluator

rlancemartin • Updated Jan 22, 2024

Key Takeaways

Common types of LLM mistakes

Hallucination (confidently stating something incorrect)

Bad output formatting

Wrong tone

Provoked to go "off the rails"

Overly cautious ("RLHF responses")

Repetitions

How Evaluation helps

Validation that your model avoids common failures modes

Common language to make fast go / no-go decisions

Roadmap for improvements to model performance

What makes a good evaluation

Correlated with outcomes

A single metric (or a small number of metrics)

Fast & automatic to compute

LLM Evaluation

We do not have access to training data of LLMs.

Prod distribution is always different than training

Trained on the internet there is always drift, it doesn't matter as much

Qualitative A Hard-to-measure success

Diversity of behaviors aggregate metrics don't work

What's wrong with benchmarks

Benchmarks don't measure performance on your use case

They don't take into account your prompting, your in-context learning, your fine-tuning, etc

They also have all of the measurement issues we're talking about

Solution

Use LLM to create your own evaluation dataset.

Do incrementally and increase the coverage

Make reliable evaluation metrics

Evaluation metrics for LLMs

Regular eval metrics

Accuracy, etc

Reference matching metrics

Semantic similarity
Ask another LLM, "are these two answers factually consistent" etc

"Which is better" metrics

Ask an LLM which of the two answers is better, according to any criteria you want

"Is the feedback incorporated" metrics

Ask an LLM whether the new answer incorporates the feedback on the old answer

Static metrics

Verify the output has the right structure (e.g., is JSON formatted)
Ask a model to grade the answer (on a scale of 1 to 5)

⚠️Disclaimer: All the screenshots, materials, and other media documents used in this article are copyrighted to the original platform or authors.

About

External Links

Blog↗️

Obsidian Notes↗️

Badges↗️

Countdowns↗️

Photography↗️