G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using large language models (LLMs) as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators. In this work, we present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs. We experiment with two generation tasks, text summarization and dialogue generation. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin. We also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts. The code is at https://github.com/nlpyang/geval

[1]  Zhixu Li,et al.  Is ChatGPT a Good NLG Evaluator? A Preliminary Study , 2023, ArXiv.

[2]  C. Federmann,et al.  Large Language Models Are State-of-the-Art Evaluators of Translation Quality , 2023, ArXiv.

[3]  Pengfei Liu,et al.  GPTScore: Evaluate as You Desire , 2023, NAACL.

[4]  Percy Liang,et al.  Benchmarking Large Language Models for News Summarization , 2023, ArXiv.

[5]  Peng Liu,et al.  Towards a Unified Multi-Dimensional Evaluator for Text Generation , 2022, EMNLP.

[6]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[7]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[8]  Noah A. Smith,et al.  Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand , 2021, NAACL.

[9]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[10]  Liang Lin,et al.  Towards Quantifiable Dialogue Coherence Evaluation , 2021, ACL.

[11]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[12]  J. C. Cheung,et al.  Factual Error Correction for Abstractive Summarization Models , 2020, EMNLP.

[13]  Mona T. Diab,et al.  FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization , 2020, ACL.

[14]  Maxine Eskenazi,et al.  USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation , 2020, ACL.

[15]  Alex Wang,et al.  Asking and Answering Questions to Evaluate the Factual Consistency of Summaries , 2020, ACL.

[16]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[17]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[18]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[19]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[20]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[21]  Noah A. Smith,et al.  Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.

[22]  Osmar R. Zaïane,et al.  Evaluating Coherence in Dialogue Systems using Entailment , 2019, NAACL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[25]  Furu Wei,et al.  Faithful to the Original: Fact Aware Neural Abstractive Summarization , 2017, AAAI.

[26]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[27]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[28]  Ian S. Dunn,et al.  Exploring the Limits , 2009 .

[29]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[30]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[31]  Matthew Marge,et al.  Evaluating Evaluation Methods for Generation in the Presence of Variation , 2005, CICLing.

[32]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.