Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80\% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. We will publicly release MT-bench questions, 3K expert votes, and 30K conversations with human preferences from Chatbot Arena.

[1]  Yunbo Cao,et al.  Large Language Models are not Fair Evaluators , 2023, ArXiv.

[2]  S. Levine,et al.  The False Promise of Imitating Proprietary LLMs , 2023, ArXiv.

[3]  Luke Zettlemoyer,et al.  QLoRA: Efficient Finetuning of Quantized LLMs , 2023, NeurIPS.

[4]  Jimmy Ba,et al.  AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback , 2023, NeurIPS.

[5]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[6]  Zhi Rui Tam,et al.  OpenAssistant Conversations - Democratizing Large Language Model Alignment , 2023, ArXiv.

[7]  Weizhu Chen,et al.  AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models , 2023, ArXiv.

[8]  Meysam Alizadeh,et al.  ChatGPT outperforms crowd workers for text-annotation tasks , 2023, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[10]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[11]  Haewoon Kwak,et al.  Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech , 2023, WWW.

[12]  Quoc V. Le,et al.  The Flan Collection: Designing Data and Methods for Effective Instruction Tuning , 2023, ICML.

[13]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[14]  Christopher D. Manning,et al.  Holistic Evaluation of Language Models , 2023, Annals of the New York Academy of Sciences.

[15]  Dongyan Zhao,et al.  MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation , 2022, ACL.

[16]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[17]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[18]  Noah A. Smith,et al.  Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , 2022, EMNLP.

[19]  Tom B. Brown,et al.  Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[20]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[21]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[22]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[23]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[24]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[25]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[26]  Hannaneh Hajishirzi,et al.  Cross-Task Generalization via Natural Language Crowdsourcing Instructions , 2021, ACL.

[27]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[28]  Jaewoo Kang,et al.  Look at the First Sentence: Position Bias in Question Answering , 2020, EMNLP.

[29]  Ronan Le Bras,et al.  WinoGrande , 2019, AAAI.

[30]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[31]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[32]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[33]  Marc Najork,et al.  Position Bias Estimation for Unbiased Learning to Rank in Personal Search , 2018, WSDM.

[34]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[35]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[36]  Jonathon D. Brown,et al.  Evaluations of Self and Others: Self-Enhancement Biases in Social Judgments , 1986 .

[37]  N. Blunch Position Bias in Multiple-Choice Questions , 1984 .

[38]  Frank Sifei Luan,et al.  SkyPilot: An Intercloud Broker for Sky Computing , 2023, NSDI.

[39]  Priya Raghubir,et al.  Center-of-inattention: Position biases in decision-making , 2006 .