Zero-shot NLG evaluation through Pairware Comparisons with LLMs

Evaluating Natural Language Generation (NLG) outputs is crucial but laborious and expensive. While various automatic NLG assessment methods have been proposed, they often are quite task-specific and have to be engineered with a particular domain and attribute in mind. In this work, we propose a robust zero-shot approach to NLG evaluation using pairwise comparative judgment with open-source Large Language Models (LLMs). The motivation for this approach is that even as humans, it is easier to determine which of two options are better, than it is to independently objectively score each option. We use this insight and leverage the emergent abilities of LLMs, where we probe FlanT5 to determine which of two candidate responses is better, rather than assigning absolute scores. Our results demonstrate that comparative assessment is a more effective approach than absolute scoring, enabling smaller open-source LLMs to achieve comparable performance to larger public access APIs. We evaluate systems on both summary evaluation and dialogue response generation, and show that opensource LLMs can lead to good correlations with human scores for a range of different attributes.

[1]  Donald Metzler,et al.  Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting , 2023, ArXiv.

[2]  Zhixu Li,et al.  Is ChatGPT a Good NLG Evaluator? A Preliminary Study , 2023, ArXiv.

[3]  C. Federmann,et al.  Large Language Models Are State-of-the-Art Evaluators of Translation Quality , 2023, EAMT.

[4]  M. Gales,et al.  MQAG: Multiple-choice Question Answering and Generation for Assessing Information Consistency in Summarization , 2023, ArXiv.

[5]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[6]  Maxine Eskenazi,et al.  USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.

[7]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[8]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[9]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[10]  Joel R. Tetreault,et al.  Discourse Coherence in the Wild: A Dataset, Evaluation and Methods , 2018, SIGDIAL Conference.

[11]  Alon Lavie,et al.  BLANC: Learning Evaluation Metrics for MT , 2005, HLT.

[12]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[13]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[14]  D. Katz,et al.  GPT-4 passes the bar exam , 2024, Philosophical Transactions of the Royal Society A.

[15]  Anja Belz,et al.  Comparing Automatic and Human Evaluation of NLG Systems , 2006, EACL.

[16]  L. Thurstone A law of comparative judgment. , 1994 .