论文信息 - Masked Language Model Scoring - 字舞流文

Masked Language Model Scoring

Pretrained masked language models (MLMs) require finetuning for most NLP tasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihood scores (PLLs), which are computed by masking tokens one by one. We show that PLLs outperform scores from autoregressive language models like GPT-2 in a variety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces an end-to-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation. We attribute this success to PLL's unsupervised expression of linguistic acceptability without a left-to-right bias, greatly improving on scores from GPT-2 (+10 points on island effects, NPI licensing in BLiMP). One can finetune MLMs to give scores without masking, enabling computation in a single inference pass. In all, PLLs and their associated pseudo-perplexities (PPPLs) enable plug-and-play use of the growing number of pretrained MLMs; e.g., we use a single cross-lingual model to rerank translations in multiple languages. We release our library for language model scoring at this https URL.

Davis Liang | Katrin Kirchhoff | Julian Salazar | Toan Q. Nguyen

[1] Kyomin Jung,et al. Effective Sentence Scoring Method using Bidirectional Language Model for Speech Recognition , 2019, ArXiv.

[2] Carson T. Schütze. The empirical base of linguistics: Grammaticality judgments and linguistic methodology , 1998 .

[3] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[4] Alex Wang,et al. BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[5] Heiga Zen,et al. Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[6] Adam Coates,et al. Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[7] Mark J. F. Gales,et al. Multi-Language Neural Network Language Models , 2016, INTERSPEECH.

[8] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[9] Yang Liu,et al. Modeling Coverage for Neural Machine Translation , 2016, ACL.

[10] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] David Chiang,et al. Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[12] Brian Roark,et al. Discriminative Language Modeling with Conditional Random Fields and the Perceptron Algorithm , 2004, ACL.

[13] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[14] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[15] Lei Li,et al. Towards Making the Most of BERT in Neural Machine Translation , 2020, AAAI.

[16] Quoc V. Le,et al. Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[17] Graham Neubig,et al. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.

[18] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Ebru Arisoy,et al. Bidirectional recurrent neural network language models for automatic speech recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[22] Orhan Firat,et al. Massively Multilingual Neural Machine Translation , 2019, NAACL.

[23] Alexander J. Smola,et al. Language Models with Transformers , 2019, ArXiv.

[24] Kyomin Jung,et al. Effective Sentence Scoring Method Using BERT for Speech Recognition , 2019, ACML.

[25] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[26] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[27] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[28] Tie-Yan Liu,et al. Incorporating BERT into Neural Machine Translation , 2020, ICLR.

[29] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[30] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[31] Veselin Stoyanov,et al. Simple Fusion: Return of the Language Model , 2018, WMT.

[32] Boris Ginsburg,et al. Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[33] Alexander Clark,et al. Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge , 2017, Cogn. Sci..

[34] Mingbo Ma,et al. Breaking the Beam Search Curse: A Study of (Re-)Scoring Methods and Stopping Criteria for Neural Machine Translation , 2018, EMNLP.

[35] Julian Salazar,et al. Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.

[36] Shinji Watanabe,et al. ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[37] Eiichiro Sumita,et al. Bidirectional Phrase-based Statistical Machine Translation , 2009, EMNLP.

[38] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[39] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[40] Yangyang Shi,et al. Exploiting the succeeding words in recurrent neural network language models , 2013, INTERSPEECH.

[41] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[42] David Chiang,et al. Improving Lexical Choice in Neural Machine Translation , 2017, NAACL.

[43] Samuel R. Bowman,et al. BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[44] Jan Niehues,et al. The IWSLT 2015 Evaluation Campaign , 2015, IWSLT.

[45] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[46] Haizhou Li,et al. Enhancing Language Models in Statistical Machine Translation with Backward N-grams and Mutual Information Triggers , 2011, ACL.

[47] Wei Ping,et al. Large Margin Neural Language Model , 2018, EMNLP.

[48] Alex Wang,et al. A Generalized Framework of Sequence Generation with Application to Undirected Sequence Models , 2019, ArXiv.

[49] Mark J. F. Gales,et al. Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition , 2017, INTERSPEECH.

[50] Graham Neubig,et al. SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation , 2018, EMNLP.

[51] Vassilina Nikoulina,et al. On the use of BERT for Neural Machine Translation , 2019, EMNLP.

[52] Graham Neubig,et al. Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[53] Kumiko Tanaka-Ishii,et al. Cross Entropy of Neural Language Models at Infinity—A New Bound of the Entropy Rate , 2018, Entropy.

[54] Sergey Edunov,et al. Pre-trained language model representations for language generation , 2019, NAACL.

[55] Tara N. Sainath,et al. A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[56] Kyunghyun Cho,et al. Passage Re-ranking with BERT , 2019, ArXiv.

[57] Lalit R. Bahl,et al. Design of a linguistic statistical decoder for the recognition of continuous speech , 1975, IEEE Trans. Inf. Theory.

[58] J. Besag. Statistical Analysis of Non-Lattice Data , 1975 .

[59] He He,et al. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..