Large Language Models are not Fair Evaluators

We uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., vicuna could beat ChatGPT on 66 over 80 tested queries. To address this issue, we propose two simple yet effective calibration strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple detailed pieces of evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine the final score. Extensive experiments demonstrate that our approach successfully mitigates evaluation bias, resulting in closer alignment with human judgments. To facilitate future research on more robust large language model comparison, we integrate the techniques in the paper into an easy-to-use toolkit \emph{FairEval}, along with the human annotations.\footnote{\url{https://github.com/i-Eval/FairEval}}

[1]  Omer Levy,et al.  LIMA: Less Is More for Alignment , 2023, NeurIPS.

[2]  Ethan Perez,et al.  Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , 2023, NeurIPS.

[3]  Yiming Yang,et al.  Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision , 2023, NeurIPS.

[4]  Hongsheng Li,et al.  LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model , 2023, ArXiv.

[5]  Chunyuan Li,et al.  Instruction Tuning with GPT-4 , 2023, ArXiv.

[6]  Julian McAuley,et al.  Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data , 2023, EMNLP.

[7]  Sam Bowman Eight Things to Know about Large Language Models , 2023, ArXiv.

[8]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[9]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[10]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[11]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[12]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[13]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[14]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.