论文信息 - Are Some Words Worth More than Others?

Are Some Words Worth More than Others?

Current evaluation metrics for language modeling and generation rely heavily on the accuracy of predicted (or generated) words as compared to a reference ground truth. While important, token-level accuracy only captures one aspect of a language model's behavior, and ignores linguistic properties of words that may allow some mis-predicted tokens to be useful in practice. Furthermore, statistics directly tied to prediction accuracy (including perplexity) may be confounded by the Zipfian nature of written language, as the majority of the prediction attempts will occur with frequently-occurring types. A model's performance may vary greatly between high- and low-frequency words, which in practice could lead to failure modes such as repetitive and dull generated text being produced by a downstream consumer of a language model. To address this, we propose two new intrinsic evaluation measures within the framework of a simple word prediction task that are designed to give a more holistic picture of a language model's performance. We evaluate several commonly-used large English language models using our proposed metrics, and demonstrate that our approach reveals functional differences in performance between the models that are obscured by more traditional metrics.

Steven Bedrick | Shiran Dudy | Steven Bedrick | Shiran Dudy

[1] Yejin Choi,et al. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[2] Georgiana Dinu,et al. Hubness and Pollution: Delving into Cross-Space Mapping for Zero-Shot Learning , 2015, ACL.

[3] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[4] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[5] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[6] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[7] Chris Callison-Burch,et al. Comparison of Diverse Decoding Methods from Conditional Language Models , 2019, ACL.

[8] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[9] Alon Lavie,et al. The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[10] Mari Ostendorf,et al. Analyzing and predicting language model improvements , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[11] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[12] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[13] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[15] Amit Seker,et al. From SPMRL to NMRL: What Did We Learn (and Unlearn) in a Decade of Parsing Morphologically-Rich Languages (MRLs)? , 2020, ACL.

[16] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[17] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[18] Jason Weston,et al. Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[19] Margaret King,et al. Evaluating natural language processing systems , 1996, CACM.

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] G. Āllport. The Psycho-Biology of Language. , 1936 .

[22] Ryan Cotterell,et al. Don’t Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction , 2019, EMNLP.