Are Some Words Worth More than Others?

Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.

[1]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[2]  Georgiana Dinu,et al.  Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning , 2015, ACL.

[3]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[4]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[5]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[6]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[7]  Chris Callison-Burch,et al.  Comparison of Diverse Decoding Methods from Conditional Language Models , 2019, ACL.

[8]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[9]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[10]  Mari Ostendorf,et al.  Analyzing and predicting language model improvements , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[11]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[12]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[15]  Amit Seker,et al.  From SPMRL to NMRL: What Did We Learn (and Unlearn) in a Decade of Parsing Morphologically-Rich Languages (MRLs)? , 2020, ACL.

[16]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[17]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[18]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[19]  Margaret King,et al.  Evaluating natural language processing systems , 1996, CACM.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  G. Āllport The Psycho-Biology of Language. , 1936 .

[22]  Ryan Cotterell,et al.  Don’t Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction , 2019, EMNLP.

[23]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[24]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[25]  Oren Etzioni,et al.  Green AI , 2019, Commun. ACM.

[26]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[27]  G. Zipf,et al.  The Psycho-Biology of Language , 1936 .

[28]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[29]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[30]  Mari Ostendorf,et al.  A new metric for stochastic language model evaluation , 1999, EUROSPEECH.

[31]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[32]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[33]  M. Brewer,et al.  Research Design and Issues of Validity , 2000 .

[34]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[35]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[36]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[37]  Jonathan Berant,et al.  Evaluating the Evaluation of Diversity in Natural Language Generation , 2020, EACL.