Language Model Evaluation Beyond Perplexity

We propose an alternate approach to quantifying how well language models learn natural language: we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the humangenerated text on which they were trained. We provide a framework—paired with significance tests—for evaluating the fit of language models to these trends. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the type– token relationship of natural language than text produced using standard ancestral sampling; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Constance L. Wood,et al.  Large Sample Results for Kolmogorov-Smirnov Statistics for Discrete Distributions , 1978 .

[5]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[6]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[7]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[8]  André F. T. Martins,et al.  Sparse Sequence-to-Sequence Models , 2019, ACL.

[9]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[10]  Germinal Cocho,et al.  Fitting Ranked Linguistic Data with Two-Parameter Functions , 2010, Entropy.

[11]  Florian Mohnert,et al.  Under the Hood: Using Diagnostic Classifiers to Investigate and Improve how Language Models Track Agreement Information , 2018, BlackboxNLP@EMNLP.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Tal Linzen,et al.  A Neural Model of Adaptation in Reading , 2018, EMNLP.

[14]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[15]  L. Devroye,et al.  No Empirical Probability Measure can Converge in the Total Variation Sense for all Distributions , 1990 .

[16]  Hamidreza Mostafaei,et al.  Probability Metrics and their Applications , 2011 .

[17]  S. Rachev,et al.  The Methods of Distances in the Theory of Probability and Statistics , 2013 .

[18]  Tatsunori B. Hashimoto,et al.  Improved Natural Language Generation via Loss Truncation , 2020, ACL.

[19]  Graham Neubig,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[20]  Sunita Sarawagi,et al.  Length bias in Encoder Decoder Models and a Case for Global Conditioning , 2016, EMNLP.

[21]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[22]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[23]  S. Horn,et al.  Goodness-of-fit tests for discrete data: a review and an application to a health impairment scale. , 1977, Biometrics.

[24]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[25]  Gemma Boleda,et al.  Probing for Referential Information in Language Models , 2020, ACL.

[26]  Kumiko Tanaka-Ishii,et al.  Do neural nets learn statistical laws behind natural language? , 2017, PloS one.

[27]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[28]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[29]  J. Christopher Beck,et al.  Empirical Analysis of Beam Search Performance Degradation in Neural Sequence Models , 2019, ICML.

[30]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[31]  S. A. Chowdhury,et al.  RNN Simulations of Grammaticality Judgments on Long-distance Dependencies , 2018, COLING.

[32]  Wilker Aziz,et al.  Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.

[33]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[34]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[35]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[36]  G. Herdan,et al.  Type-token mathematics : a textbook of mathematical linguistics , 1960 .

[37]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[38]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[41]  Omer Levy,et al.  Deep RNNs Encode Soft Hierarchical Syntax , 2018, ACL.

[42]  Christian Schluter On Zipf’s law and the bias of Zipf regressions , 2020, Empirical Economics.

[43]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[44]  Kumiko Tanaka-Ishii,et al.  Evaluating Computational Language Models with Scaling Properties of Natural Language , 2019, Computational Linguistics.

[45]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[46]  David Chiang,et al.  Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[47]  Francesc Font-Clos,et al.  Large-Scale Analysis of Zipf’s Law in English Texts , 2015, PloS one.