Evaluation for Large Language Models and Generative AI

Home / Training & Certifications / Training Vault

Important Links

LLM-Evaluation

rajshah4 • Updated Jan 22, 2024

Key Takeaways

Are Leaderboards useful?

Automated evaluation pipelines are better.

Methods for evaluation

Exact matching approach

Prompts are highly sensitive to characters like bracket or space

Similarity approach

source: Mathematics | Free Full-Text | A Survey on Evaluation Metrics for Machine Translation (mdpi.com)

Types:

BLEU score
Exact match
Edit distance
ROUGE scores
Word Error Rate
METEOR
Cosine similarity

Pros:

Fast
Easy

Cons:

Biased towards shorter text → Gives better result there.
Does not consider meaning and sentence structure
Tokenization redundant

Functional Correctness Evaluation

Benchmarks

GLUE
BigBench
List of commonly used benchmarks for LLMs — LessWrong
Pros:

A wide range of tasks can be evaluated
cheap and easy

Cons:

Good for MCQs, not for free form texts
Leakage of training on test data

Human Evaluation

Pros:

Humans can evaluate a WIDE variety of outputs
Humans are the gold standard for some benchmarks

Cons:

Humans are expensive
Humans are large variation (mood and emotions)
Can be biased and low factuality
Can be manipulated by different prompts

Human Comparison/Arena

Model based Approaches

gEval
SelfCheckGPT
JudgeLM → Write a model to evaluate another model
Ragas

Red Teaming

Consistent output can be made sure with:

OpenAI function calling &

Microsoft guardrailing

Evaluation Components

Retrieval:

Low Precision: Not all chunks in retrieved set are relevant

Low Recall: Not all relevant chunks are retrieved.

Were they in the proper order?
Were they outdated
What was the latency?

Augmentation:

How can we ensure the answer were factually correct?

How can we measure the answers were understandable?

Toxicity/Bias issues

How can we measure latency

⚠️Disclaimer: All the screenshots, materials, and other media documents used in this article are copyrighted to the original platform or authors.

About

External Links

Blog↗️

Obsidian Notes↗️

Badges↗️

Countdowns↗️

Photography↗️

Evaluation for Large Language Models and Generative AI - A Deep Dive

Important Links

Key Takeaways

About

External Links