论文信息 - Masked Language Model Scoring - 字舞流文

Masked Language Model Scoring

Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model’s WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL’s unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rescore translations in multiple languages. We release our library for language model scoring at https://github.com/awslabs/mlm-scoring.

Davis Liang | Katrin Kirchhoff | Julian Salazar | Toan Q. Nguyen | Katrin Kirchhoff | Davis Liang | Julian Salazar | K. Kirchhoff

[1] Yang Liu,et al. Modeling Coverage for Neural Machine Translation , 2016, ACL.

[2] Alex Wang,et al. A Generalized Framework of Sequence Generation with Application to Undirected Sequence Models , 2019, ArXiv.

[3] J. Besag. Statistical Analysis of Non-Lattice Data , 1975 .

[4] Carson T. Schütze. The empirical base of linguistics: Grammaticality judgments and linguistic methodology , 1998 .

[5] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[6] Veselin Stoyanov,et al. Simple Fusion: Return of the Language Model , 2018, WMT.

[7] Boris Ginsburg,et al. Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[8] Kyunghyun Cho,et al. Passage Re-ranking with BERT , 2019, ArXiv.

[9] Lalit R. Bahl,et al. Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[10] Tara N. Sainath,et al. A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[11] Mark J. F. Gales,et al. Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition , 2017, INTERSPEECH.

[12] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[13] Adam Coates,et al. Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[14] Kyomin Jung,et al. Effective Sentence Scoring Method using Bidirectional Language Model for Speech Recognition , 2019, ArXiv.

[15] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[16] Kyomin Jung,et al. Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.

[17] Quoc V. Le,et al. Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[18] Mark J. F. Gales,et al. Multi-Language Neural Network Language Models , 2016, INTERSPEECH.

[19] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[20] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[21] Kumiko Tanaka-Ishii,et al. Cross Entropy of Neural Language Models at Infinity—A New Bound of the Entropy Rate , 2018, Entropy.

[22] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[23] Tie-Yan Liu,et al. Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[24] David Chiang,et al. Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[25] Julian Salazar,et al. Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.

[26] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[27] Sergey Edunov,et al. Pre-trained language model representations for language generation , 2019, NAACL.

[28] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Eiichiro Sumita,et al. Bidirectional Phrase-based Statistical Machine Translation , 2009, EMNLP.

[30] Vassilina Nikoulina,et al. On the use of BERT for Neural Machine Translation , 2019, EMNLP.

[31] Wei Ping,et al. Large Margin Neural Language Model , 2018, EMNLP.

[32] Brian Roark,et al. Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[33] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[34] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[35] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[36] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[37] Alexander J. Smola,et al. Language Models with Transformers , 2019, ArXiv.

[38] Graham Neubig,et al. Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[39] Jan Niehues,et al. The IWSLT 2015 Evaluation Campaign , 2015, IWSLT.

[40] Alex Wang,et al. BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[41] Mingbo Ma,et al. Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation , 2018, EMNLP.

[42] Graham Neubig,et al. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.

[43] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[44] Samuel R. Bowman,et al. BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[45] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46] Graham Neubig,et al. SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation , 2018, EMNLP.

[47] Ebru Arisoy,et al. Bidirectional recurrent neural network language models for automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[48] He He,et al. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[49] Yangyang Shi,et al. Exploiting the succeeding words in recurrent neural network language models , 2013, INTERSPEECH.

[50] Haizhou Li,et al. Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers , 2011, ACL.

[51] Orhan Firat,et al. Massively Multilingual Neural Machine Translation , 2019, NAACL.

[52] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[53] Alexander Clark,et al. Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge , 2017, Cogn. Sci..

[54] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[55] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[56] Lei Li,et al. Towards Making the Most of BERT in Neural Machine Translation , 2020, AAAI.

[57] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[58] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[59] David Chiang,et al. Improving Lexical Choice in Neural Machine Translation , 2017, NAACL.