The statistical advantage of automatic NLG metrics at the system level

Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating systemlevel quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small.1

[1]  Dragomir R. Radev,et al.  Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , 2019, ACL.

[2]  Pete Koomen,et al.  Peeking at A/B Tests: Why it matters, and what to do about it , 2017, KDD.

[3]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[4]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[5]  Christopher D. Manning,et al.  Importance sampling for unbiased on-demand evaluation of knowledge base population , 2017, EMNLP.

[6]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[7]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[8]  Jun-Ping Ng,et al.  Better Summarization Evaluation with Word Embeddings for ROUGE , 2015, EMNLP.

[9]  Timothy Baldwin,et al.  Is all that Glitters in Machine Translation Quality Estimation really Gold? , 2016, COLING.

[10]  Gang Li,et al.  Performance of Regression Models as a Function of Experiment Noise , 2019, Bioinformatics and biology insights.

[11]  Dongyan Zhao,et al.  RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.

[12]  Anastassia Loukina,et al.  Using PRMSE to evaluate automated scoring systems in the presence of label noise , 2020, BEA.

[13]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[14]  Aditi Raghunathan,et al.  Certified Robustness to Adversarial Word Substitutions , 2019, EMNLP.

[15]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[16]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[17]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[18]  Philipp Koehn,et al.  Proceedings of the LREC 2016 Workshop “Translation Evaluation – From Fragmented Tools and Data Sets to an Integrated Ecosystem” , 2016 .

[19]  Markus Freitag,et al.  APE at Scale and Its Implications on MT Evaluation Biases , 2019, WMT.

[20]  Aleksander Madry,et al.  Identifying Statistical Bias in Dataset Replication , 2020, ICML.

[21]  Wei Zhao,et al.  SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization , 2020, ACL.

[22]  Ido Dagan,et al.  Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation , 2019, NAACL.

[23]  Katherine A. Keith,et al.  Uncertainty-aware generative models for inferring document class prevalence , 2018, EMNLP.

[24]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[25]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[26]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[27]  Dan Klein,et al.  An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.

[28]  Percy Liang,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[29]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[30]  Rico Sennrich,et al.  Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.

[31]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[32]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[33]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Quoc V. Le,et al.  Massive Exploration of Neural Machine Translation Architectures , 2017, EMNLP.

[36]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[37]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[38]  Ondrej Bojar,et al.  Results of the WMT16 Metrics Shared Task , 2016 .

[39]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[41]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[42]  Philipp Koehn,et al.  Ten Years of WMT Evaluation Campaigns: Lessons Learnt , 2016 .

[43]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[44]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[45]  Graham Neubig,et al.  compare-mt: A Tool for Holistic Comparison of Language Generation Systems , 2019, NAACL.

[46]  Pedro M. Domingos A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.

[47]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[48]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[49]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[50]  Peter Henderson,et al.  With Little Power Comes Great Responsibility , 2020, EMNLP.

[51]  Ondrej Bojar,et al.  Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[52]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.