LLMs for Evaluating LLMs

arthur-aiUpdated Jan 24, 2024

Key Takeaways

PreLLM Evaluation

  • Word presence → BLEU
  • Pre-trained NLP model → BERT Score
  • Perturbation Test → What changes in the output when we switch up words in the input?

LLM Evaluation

  • Sensitive to single-word change
  • Larger number multiplication issues


How do we evaluate subjective tasks?
  • Collect as many examples of good and bad output for the task as possible.
  • Find out the common attributes in these examples.
  • Give this as a prompt in few shot prompts to LLM

