GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks which can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency), and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available,1 and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.

[1]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[2]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[3]  Lucia Specia,et al.  Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale , 2020, COLING.

[4]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[5]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[6]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[7]  Timothy Baldwin,et al.  Is Machine Translation Getting Better over Time? , 2014, EACL.

[8]  Jianfeng Gao,et al.  An Information-Theoretic Approach to Automatic Evaluation of Summaries , 2006, NAACL.

[9]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[10]  Matt Gardner,et al.  MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics , 2020, EMNLP.

[11]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[12]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[13]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[14]  Deborah A. Coughlin,et al.  Correlating automated and human assessments of machine translation quality , 2003, MTSUMMIT.

[15]  Philipp Koehn,et al.  Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[16]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[17]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[18]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[19]  Mausam,et al.  To Re(label), or Not To Re(label) , 2014, HCOMP.

[20]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[21]  Philipp Koehn,et al.  Translationese in Machine Translation Evaluation , 2019, EMNLP.

[22]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[23]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[24]  Yao Zhao,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2020, ICML.

[25]  Rajendra Bhatia,et al.  A Better Bound on the Variance , 2000, Am. Math. Mon..

[26]  Benjamin Van Durme,et al.  Efficient Online Scalar Annotation with Bounded Support , 2018, ACL.

[27]  Dragomir R. Radev,et al.  Generating summaries of multiple news articles , 1995, SIGIR '95.

[28]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yejin Choi,et al.  Evaluating Machines by their Real-World Language Use , 2020, ArXiv.

[30]  Ido Dagan,et al.  Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.

[31]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[33]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[34]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[35]  Shashi Narayan,et al.  HighRES: Highlight-based Reference-less Evaluation of Summarization , 2019, ACL.

[36]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[37]  Chris Callison-Burch,et al.  ChatEval: A Tool for Chatbot Evaluation , 2019, NAACL.

[38]  Richard Socher,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, ArXiv.

[39]  Ani Nenkova,et al.  Evaluating Content Selection in Summarization: The Pyramid Method , 2004, NAACL.

[40]  Philipp Koehn,et al.  Johns Hopkins University Submission for WMT News Translation Task , 2019, WMT.

[41]  Lora Aroyo,et al.  Metrology for AI: From Benchmarks to Instruments , 2019, ArXiv.

[42]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[43]  Gunhee Kim,et al.  Abstractive Summarization of Reddit Posts with Multi-level Memory Networks , 2018, NAACL.

[44]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[45]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[46]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[47]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[48]  Ido Dagan,et al.  How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[49]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[50]  Michael S. Bernstein,et al.  HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models , 2019, NeurIPS.

[51]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[52]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[53]  Ondrej Bojar,et al.  Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[54]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[55]  Albert Gatt,et al.  Best practices for the human evaluation of automatically generated text , 2019, INLG.

[56]  Nanyun Peng,et al.  STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation , 2020, EMNLP.

[57]  Yejin Choi,et al.  Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning , 2020, EMNLP.

[58]  Percy Liang,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[59]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[60]  Sameer Singh,et al.  Evaluating Question Answering Evaluation , 2019, EMNLP.

[61]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[62]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[63]  Myle Ott,et al.  On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.