Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand

Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation models and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlation with human judgments. We release four Billboards for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.

[1]  Noah A. Smith,et al.  Transparent Human Evaluation for Image Captioning , 2021, NAACL.

[2]  Tal August,et al.  All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text , 2021, ACL.

[3]  Atsushi Fujita,et al.  Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers , 2021, ACL.

[4]  Anjana Arunkumar,et al.  How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation , 2021, AAAI.

[5]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[6]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[7]  Jungo Kasai,et al.  Finetuning Pretrained Transformers into RNNs , 2021, EMNLP.

[8]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[9]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[10]  Jungo Kasai,et al.  GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation , 2021, ArXiv.

[11]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[12]  Dragomir R. Radev,et al.  SummEval: Re-evaluating Summarization Evaluation , 2020, Transactions of the Association for Computational Linguistics.

[13]  Jungo Kasai,et al.  Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation , 2020, ICLR.

[14]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[15]  Xiao Pan,et al.  The Volctrans Machine Translation System for WMT20 , 2020, WMT@EMNLP.

[16]  Xiangang Li,et al.  DiDi's Machine Translation System for WMT2020 , 2020, WMT@EMNLP.

[17]  Hai Zhao,et al.  SJTU-NICT's Supervised and Unsupervised Neural Machine Translation Systems for the WMT20 News Translation Task , 2020, WMT@EMNLP.

[18]  Jie Zhou,et al.  WeChat Neural Machine Translation Systems for WMT20 , 2020, WMT@EMNLP.

[19]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[20]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Jeffrey P. Bigham,et al.  Twitter A11y: A Browser Extension to Make Twitter Images Accessible , 2020, CHI.

[23]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[24]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[25]  Matt Post,et al.  Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing , 2020, EMNLP.

[26]  Peter J. Liu,et al.  PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization , 2019, ICML.

[27]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[28]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[29]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[30]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[31]  Myle Ott,et al.  On The Evaluation of Machine Translation SystemsTrained With Back-Translation , 2019, ACL.

[32]  Oren Etzioni,et al.  Green AI , 2019, Commun. ACM.

[33]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[34]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[35]  Maite Oronoz,et al.  Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages , 2020, WMT.

[36]  Srivatsan Srinivasan,et al.  The DeepMind Chinese–English Document Translation System at WMT2020 , 2020, WMT.

[37]  Shuming Shi,et al.  Tencent Neural Machine Translation Systems for the WMT20 News Translation Task , 2020, WMT.

[38]  Shiliang Sun,et al.  HW-TSC's Participation in the WMT 2020 News Translation Shared Task , 2020, WMT@EMNLP.

[39]  Xiaopu Li,et al.  OPPO's Machine Translation Systems for WMT20 , 2020, WMT@EMNLP.

[40]  Andreas Eisele,et al.  eTranslation's Submissions to the WMT 2020 News Translation Task , 2020, WMT@EMNLP.

[41]  Alexander Molchanov PROMT Systems for WMT 2020 Shared News Translation Task , 2020, WMT@EMNLP.

[42]  Jeremy Gwinnup,et al.  The AFRL WMT20 News Translation Systems , 2020, WMT@EMNLP.

[43]  Ulrich Germann The University of Edinburgh's submission to the German-to-English and English-to-German Tracks in the WMT 2020 News Translation and Zero-shot Translation Robustness Tasks , 2020, WMT@EMNLP.

[44]  Jun Suzuki,et al.  Tohoku-AIP-NTT at WMT 2020 News Translation Task , 2020, WMT@EMNLP.

[45]  Philipp Koehn,et al.  Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[46]  Dan Jurafsky,et al.  Utility is in the Eye of the User: A Critique of NLP Leaderboards , 2020, EMNLP.

[47]  Ioannis Konstas,et al.  Findings of the Fourth Workshop on Neural Generation and Translation , 2020, NGT@ACL.

[48]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[49]  Tom B. Brown,et al.  Fine-Tuning Language Models from Human Preferences , 2019, ArXiv.

[50]  Zhe Hu,et al.  An Entity-Driven Framework for Abstractive Summarization , 2019, EMNLP.

[51]  Sylvain Lamprier,et al.  Answers Unite! Unsupervised Metrics for Reinforced Summarization Models , 2019, EMNLP.

[52]  Richard Socher,et al.  Neural Text Summarization: A Critical Evaluation , 2019, EMNLP.

[53]  Ido Dagan,et al.  Better Rewards Yield Better Summaries: Learning to Summarise Without References , 2019, EMNLP.

[54]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[55]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[56]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[57]  Antoine Bonnefoy,et al.  STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings , 2019, ACL.

[58]  Noah A. Smith,et al.  Evaluating Gender Bias in Machine Translation , 2019, ACL.

[59]  Michael Elhadad,et al.  Question Answering as an Automatic Evaluation Metric for News Article Summarization , 2019, NAACL.

[60]  Noah A. Smith,et al.  Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.

[61]  Rico Sennrich,et al.  When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion , 2019, ACL.

[62]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[63]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[64]  Jiacheng Xu,et al.  Neural Extractive Text Summarization with Syntactic Compression , 2019, EMNLP.

[65]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[66]  Jackie Chi Kit Cheung,et al.  BanditSum: Extractive Summarization as a Contextual Bandit , 2018, EMNLP.

[67]  Mohit Bansal,et al.  Closed-Book Training to Improve Summarization Encoder Memory , 2018, EMNLP.

[68]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[69]  Mirella Lapata,et al.  Neural Latent Extractive Document Summarization , 2018, EMNLP.

[70]  C. Laymon A. study , 2018, Predication and Ontology.

[71]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[72]  Richard Socher,et al.  Improving Abstraction in Text Summarization , 2018, EMNLP.

[73]  Tiejun Zhao,et al.  Neural Document Summarization by Jointly Learning to Score and Select Sentences , 2018, ACL.

[74]  Yen-Chun Chen,et al.  Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting , 2018, ACL.

[75]  Ramakanth Pasunuru,et al.  Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation , 2018, ACL.

[76]  Min Sun,et al.  A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss , 2018, ACL.

[77]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[78]  Yuxiang Wu,et al.  Learning to Extract Coherent Summary via Deep Reinforcement Learning , 2018, AAAI.

[79]  Ramakanth Pasunuru,et al.  Multi-Reward Reinforced Summarization with Saliency and Entailment , 2018, NAACL.

[80]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[81]  Mirella Lapata,et al.  Ranking Sentences for Extractive Summarization with Reinforcement Learning , 2018, NAACL.

[82]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[83]  The Evaluation Machine , 2017 .

[84]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[85]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[86]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Maja Popovic,et al.  chrF++: words helping character n-grams , 2017, WMT.

[88]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[89]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[90]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[91]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[92]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[93]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[94]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[95]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[96]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[97]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[99]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[100]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[101]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[102]  Ondrej Bojar,et al.  Results of the WMT14 Metrics Shared Task , 2013 .

[103]  Yang Liu,et al.  Non-Expert Evaluation of Summarization Systems is Risky , 2010, Mturk@HLT-NAACL.

[104]  Ian S. Dunn,et al.  Exploring the Limits , 2009 .

[105]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[106]  Dou Shen,et al.  TEXT SUMMARIZATION , 2022, YMER Digital.

[107]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[108]  Feifan Liu,et al.  Correlation between ROUGE and Human Evaluation of Extractive Meeting Summaries , 2008, ACL.

[109]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[110]  B. MacWhinney A UNIFIED MODEL , 2007 .

[111]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[112]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[113]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[114]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[115]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[116]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .