Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

We conduct a large-scale, systematic study to evaluate the existing evaluation methods for natural language generation in the context of generating online product reviews. We compare human-based evaluators with a variety of automated evaluation procedures, including discriminative evaluators that measure how well machine-generated text can be distinguished from human-written text, as well as word overlap metrics that assess how similar the generated text compares to human-written references. We determine to what extent these different evaluators agree on the ranking of a dozen of state-of-the-art generators for online product reviews. We find that human evaluators do not correlate well with discriminative evaluators, leaving a bigger question of whether adversarial accuracy is the correct objective for natural language generation. In general, distinguishing machine-generated text is challenging even for human evaluators, and human decisions correlate better with lexical overlaps. We find lexical diversity an intriguing metric that is indicative of the assessments of different evaluators. A post-experiment survey of participants provides insights into how to evaluate and improve the quality of natural language generation systems.

[1]  W. W. Daniel,et al.  Applied Nonparametric Statistics , 1978 .

[2]  R. Michael Young,et al.  Using Grice's maxim of Quantity to select the content of plan descriptions , 1999, Artif. Intell..

[3]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Andrew M. Dai,et al.  MaskGAN: Better Text Generation via Filling in the ______ , 2018, ICLR.

[5]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[6]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[7]  Lior Wolf,et al.  Unsupervised Cross-Domain Image Generation , 2016, ICLR.

[8]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[9]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[10]  Ehud Reiter,et al.  Should Corpora Texts Be Gold Standards for NLG? , 2002, INLG.

[11]  Vaishnavh Nagarajan Theoretical Insights into Memorization in GANs , 2019 .

[12]  Sanja Fidler,et al.  Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[14]  Yoshua Bengio,et al.  Maximum-Likelihood Augmented Discrete Generative Adversarial Networks , 2017, ArXiv.

[15]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[16]  Ehud Reiter,et al.  Using a Randomised Controlled Clinical Trial to Evaluate an NLG System , 2001, ACL.

[17]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[18]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[19]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[20]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[21]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[22]  Mubarak Shah,et al.  Semi and Weakly Supervised Semantic Segmentation Using Generative Adversarial Network , 2017, ArXiv.

[23]  E. S. Pearson,et al.  TESTS FOR RANK CORRELATION COEFFICIENTS. I , 1957 .

[24]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[25]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[26]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[27]  Gregory A. Sanders,et al.  The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results , 2009, Machine Translation.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  Adam Lopez,et al.  Putting Human Assessments of Machine Translation Systems in Order , 2012, WMT@NAACL-HLT.

[30]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[31]  Elia Bruni,et al.  Adversarial evaluation for open-domain dialogue generation , 2017, SIGDIAL Conference.

[32]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[33]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[34]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[35]  Donna K. Harman,et al.  Collaborative information seeking and retrieval , 2006 .

[36]  Percy Liang,et al.  Generating Sentences by Editing Prototypes , 2017, TACL.

[37]  Hannes Schulz,et al.  Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[38]  Sandeep Subramanian,et al.  Adversarial Generation of Natural Language , 2017, Rep4NLP@ACL.

[39]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[40]  Johanna D. Moore,et al.  Generating and evaluating evaluative arguments , 2006, Artif. Intell..

[41]  Erik Christensen,et al.  Methodology of superiority vs. equivalence trials and non-inferiority trials. , 2007, Journal of hepatology.

[42]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[43]  Peter Glöckner,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2013 .

[44]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[45]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[46]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[47]  David Hardcastle,et al.  Can we Evaluate the Quality of Generated Text? , 2008, LREC.

[48]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[49]  Sylvain Paris,et al.  Deep Photo Style Transfer , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[51]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[52]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[53]  A M Turing,et al.  Computing Machinery and Intelligence A.M. Turing , 2007 .

[54]  Anna Korhonen,et al.  Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction , 2018, TACL.

[55]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[56]  Mirella Lapata,et al.  Chinese Poetry Generation with Recurrent Neural Networks , 2014, EMNLP.

[57]  Alexei A. Efros,et al.  Generative Visual Manipulation on the Natural Image Manifold , 2016, ECCV.

[58]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[59]  Jonathan Berant,et al.  Evaluating Text GANs as Language Models , 2018, NAACL.

[60]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[61]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[62]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Charles A. Sutton,et al.  VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[64]  Chris Mellish,et al.  Evaluation in the context of natural language generation , 1998, Comput. Speech Lang..

[65]  Barbara Di Eugenio,et al.  The DIAG experiments: Natural Language Generation for Intelligent Tutoring Systems , 2002, INLG.

[66]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[67]  Zhi Chen,et al.  Adversarial Feature Matching for Text Generation , 2017, ICML.

[68]  Mahdieh Soleymani Baghshah,et al.  Jointly Measuring Diversity and Quality in Text Generation Models , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[69]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[70]  Kevin Lin,et al.  Adversarial Ranking for Language Generation , 2017, NIPS.

[71]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[72]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[73]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[74]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[75]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[76]  Mike Lewis,et al.  Hierarchical Text Generation and Planning for Strategic Dialogue , 2017, ICML.

[77]  Steven A Julious,et al.  Practical guide to sample size calculations: non‐inferiority and equivalence trials , 2016, Pharmaceutical statistics.

[78]  Andrew Brock,et al.  Neural Photo Editing with Introspective Adversarial Networks , 2016, ICLR.

[79]  Chris Mellish,et al.  Towards Evaluation in Natural Language Generation , 1998, LREC.

[80]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[81]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[82]  Yifan Yang,et al.  Context-aware Natural Language Generation with Recurrent Neural Networks , 2016, ArXiv.

[83]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[84]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[85]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[86]  Philip Bachman,et al.  Data Generation as Sequential Decision Making , 2015, NIPS.

[87]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[89]  Yong Yu,et al.  Neural Text Generation: Past, Present and Beyond , 2018, 1803.07133.

[90]  Yoav Goldberg,et al.  Controlling Linguistic Style Aspects in Neural Language Generation , 2017, ArXiv.

[91]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[92]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[93]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[94]  Ananthram Swami,et al.  Practical Black-Box Attacks against Machine Learning , 2016, AsiaCCS.

[95]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[96]  Stanislau Semeniuta,et al.  On Accurate Evaluation of GANs for Language Generation , 2018, ArXiv.

[97]  Mirella Lapata,et al.  Learning to Generate Product Reviews from Attributes , 2017, EACL.

[98]  Max F. Meyer,et al.  The Proof and Measurement of Association between Two Things. , 1904 .

[99]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[100]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[101]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[102]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[103]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[104]  Oriol Vinyals,et al.  Adversarial Evaluation of Dialogue Models , 2017, ArXiv.

[105]  Xuanjing Huang,et al.  Towards Diverse Text Generation with Inverse Reinforcement Learning , 2018, ArXiv.

[106]  Alan Ritter,et al.  Adversarial Learning for Neural Dialogue Generation , 2017, EMNLP.

[107]  Camille Couprie,et al.  Semantic Segmentation using Adversarial Networks , 2016, NIPS 2016.

[108]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[109]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[110]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[111]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[112]  Xuanjing Huang,et al.  Toward Diverse Text Generation with Inverse Reinforcement Learning , 2018, IJCAI.

[113]  Padhraic Smyth,et al.  Text-based measures of document diversity , 2013, KDD.

[114]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[115]  Ferenc Huszar,et al.  How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? , 2015, ArXiv.

[116]  Yong Yu,et al.  Long Text Generation via Adversarial Training with Leaked Information , 2017, AAAI.

[117]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[118]  Joelle Pineau,et al.  Language GANs Falling Short , 2018, ICLR.

[119]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[120]  Enrique Alfonseca,et al.  Eval all, trust a few, do wrong to none: Comparing sentence generation models , 2018, ArXiv.