MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers

As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We introduce MAUVE, a comparison measure for open-ended text generation, which directly compares the learnt distribution from a text generation model to the distribution of human-written text using divergence frontiers. MAUVE scales up to modern text generation models by computing information divergences in a quantized embedding space. Through an extensive empirical study on three open-ended generation tasks, we find that MAUVE identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.

[1]  Stanislau Semeniuta,et al.  On Accurate Evaluation of GANs for Language Generation , 2018, ArXiv.

[2]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[3]  Daphne Ippolito,et al.  Trading Off Diversity and Quality in Natural Language Generation , 2020, HUMEVAL.

[4]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[5]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[6]  Anette Frank,et al.  Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMR , 2020, EACL.

[7]  Mamoru Komachi,et al.  RUSE: Regressor Using Sentence Embeddings for Automatic Machine Translation Evaluation , 2018, WMT.

[8]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[9]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[10]  D. Hunter MM algorithms for generalized Bradley-Terry models , 2003 .

[11]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[12]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[13]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[14]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[15]  N. Vayatis,et al.  Overlaying Classifiers: A Practical Approach to Optimal Scoring , 2010 .

[16]  Marzena Karpinska,et al.  The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation , 2021, EMNLP.

[17]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[18]  Noah A. Smith,et al.  Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.

[19]  Mehryar Mohri,et al.  Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.

[20]  T. Han,et al.  Mathematics of information and coding , 2001 .

[21]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[22]  Jianfeng Gao,et al.  PlotMachines: Outline-Conditioned Generation with Dynamic Plot State Tracking , 2020, EMNLP.

[23]  Yejin Choi,et al.  Divergence Frontiers for Generative Models: Sample Complexity, Quantization Level, and Frontier Integral , 2021, arXiv.org.

[24]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[25]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[26]  Joelle Pineau,et al.  The Second Conversational Intelligence Challenge (ConvAI2) , 2019, The NeurIPS '18 Competition.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Tal August,et al.  All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text , 2021, ACL.

[29]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[30]  Joelle Pineau,et al.  Language GANs Falling Short , 2018, ICLR.

[31]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[32]  Mitesh M. Khapra,et al.  A Survey of Evaluation Metrics Used for NLG Systems , 2020, ACM Comput. Surv..

[33]  Dongyan Zhao,et al.  RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.

[34]  Stéphan Clémençon,et al.  Nonparametric estimation of the precision-recall curve , 2009, ICML '09.

[35]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[36]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[37]  André F. T. Martins,et al.  Sparse Text Generation , 2020, EMNLP.

[38]  Anastasia Shimorina,et al.  The Human Evaluation Datasheet 1.0: A Template for Recording Details of Human Evaluation Experiments in NLP , 2021, ArXiv.

[39]  Olivier Bachem,et al.  Assessing Generative Models via Precision and Recall , 2018, NeurIPS.

[40]  Jaakko Lehtinen,et al.  Improved Precision and Recall Metric for Assessing Generative Models , 2019, NeurIPS.

[41]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[42]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[43]  David Lopez-Paz,et al.  Revisiting Classifier Two-Sample Tests , 2016, ICLR.

[44]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[45]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[46]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  Minlie Huang,et al.  UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation , 2020, EMNLP.

[49]  Cordelia Schmid,et al.  Spreading vectors for similarity search , 2018, ICLR.

[50]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[51]  Perttu Hämäläinen,et al.  Deep Residual Mixture Models , 2020, ArXiv.

[52]  Kaisa Miettinen,et al.  Nonlinear multiobjective optimization , 1998, International series in operations research and management science.

[53]  Simon Mille,et al.  Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing , 2020, INLG.

[54]  C. Villani Topics in Optimal Transportation , 2003 .

[55]  Wilker Aziz,et al.  Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.

[56]  Kyunghyun Cho,et al.  Consistency of a Recurrent Language Model With Respect to Incomplete Decoding , 2020, EMNLP.

[57]  Fabrizio Silvestri,et al.  How Decoding Strategies Affect the Verifiability of Generated Text , 2020, FINDINGS.

[58]  Chris Callison-Burch,et al.  Human and Automatic Detection of Generated Text , 2019, ArXiv.

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[61]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[62]  Olivier Bachem,et al.  Precision-Recall Curves Using Information Divergence Frontiers , 2019, AISTATS.

[63]  J. Marden Analyzing and Modeling Rank Data , 1996 .

[64]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[65]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.