Evaluation for Large Language Models and Generative AI - A Deep Dive

Important Links

LLM-Evaluation
rajshah4Updated Jan 22, 2024

Key Takeaways

Are Leaderboards useful?
Automated evaluation pipelines are better.
notion image
Methods for evaluation
  1. Exact matching approach
    1. Prompts are highly sensitive to characters like bracket or space
  1. Similarity approach
    1. Types:
      1. BLEU score
      2. Exact match
      3. Edit distance
      4. ROUGE scores
      5. Word Error Rate
      6. METEOR
      7. Cosine similarity
    2. Pros:
      1. Fast
      2. Easy
    3. Cons:
      1. Biased towards shorter text → Gives better result there.
      2. Does not consider meaning and sentence structure
      3. Tokenization redundant
  1. Functional Correctness Evaluation
  1. Benchmarks
    1. GLUE
    2. BigBench
    3. List of commonly used benchmarks for LLMs — LessWrong
    4. Pros:
      1. A wide range of tasks can be evaluated
      2. cheap and easy
    5. Cons:
      1. Good for MCQs, not for free form texts
      2. Leakage of training on test data
  1. Human Evaluation
    1. Pros:
      1. Humans can evaluate a WIDE variety of outputs
      2. Humans are the gold standard for some benchmarks
    2. Cons:
      1. Humans are expensive
      2. Humans are large variation (mood and emotions)
      3. Can be biased and low factuality
      4. Can be manipulated by different prompts
  1. Human Comparison/Arena
  1. Model based Approaches
    1. gEval
    2. SelfCheckGPT
    3. JudgeLM → Write a model to evaluate another model
    4. Ragas
  1. Red Teaming
 
Consistent output can be made sure with:
  1. OpenAI function calling &
  1. Microsoft guardrailing
 
Evaluation Components
Retrieval:
  • Low Precision: Not all chunks in retrieved set are relevant
  • Low Recall: Not all relevant chunks are retrieved.
    • Were they in the proper order?
    • Were they outdated
    • What was the latency?
    •  
Augmentation:
  • How can we ensure the answer were factually correct?
  • How can we measure the answers were understandable?
  • Toxicity/Bias issues
  • How can we measure latency
 

⚠️Disclaimer: All the screenshots, materials, and other media documents used in this article are copyrighted to the original platform or authors.