论文信息 - Nonparametric Masked Language Modeling

Nonparametric Masked Language Modeling

Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. NPM fills in the [MASK] solely from retrieving a token from a text corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 16 tasks including classification, fact probing and question answering demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better at dealing with rare patterns (word senses or facts) and predicting rare or nearly unseen words (e.g., non-Latin script). We release the model and code at github.com/facebookresearch/NPM.

[1] Yan Wang,et al. Copy is All You Need , 2023, ICLR.

[2] Colin Raffel,et al. Large Language Models Struggle to Learn Long-Tail Knowledge , 2022, ICML.

[3] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[4] Mohammad Sadegh Rasooli,et al. Bidirectional Language Models Are Also Few-shot Learners , 2022, ICLR.

[5] Edouard Grave,et al. PEER: A Collaborative Language Model , 2022, ICLR.

[6] Jane A. Yu,et al. Few-shot Learning with Retrieval Augmented Language Models , 2022, J. Mach. Learn. Res..

[7] Danqi Chen,et al. Training Language Models with Memory Augmentation , 2022, EMNLP.

[8] Naman Goyal,et al. On the Role of Bidirectionality in Language Model Pre-Training , 2022, EMNLP.

[9] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[10] Minjoon Seo,et al. TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models , 2022, EMNLP.

[11] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[12] D. Klein,et al. Describing Differences between Text Distributions with Natural Language , 2022, ICML.

[13] Rajarshi Bhowmik,et al. Learning Rich Representation of Keyphrases from Text , 2021, NAACL-HLT.

[14] Diego de Las Casas,et al. Improving language models by retrieving from trillions of tokens , 2021, ICML.

[15] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[16] Zaiqiao Meng,et al. TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning , 2021, NAACL-HLT.

[17] Frank F. Xu,et al. Capturing Structural Locality in Non-parametric Language Models , 2021, ICLR.

[18] Alyssa Lees,et al. ReasonBERT: Pre-trained to Reason with Distant Supervision , 2021, EMNLP.

[19] Taylor Berg-Kirkpatrick,et al. Efficient Nearest Neighbor Language Models , 2021, EMNLP.

[20] Jimmy J. Lin,et al. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations , 2021, SIGIR.

[21] William W. Cohen,et al. Time-Aware Language Models as Temporal Knowledge Bases , 2021, TACL.

[22] Ikuya Yamada,et al. Efficient Passage Retrieval with Hashing for Open-domain Question Answering , 2021, ACL.

[23] Luke Zettlemoyer,et al. Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , 2021, EMNLP.

[24] Zexuan Zhong,et al. Factual Probing Is [MASK]: Learning vs. Learning to Recall , 2021, NAACL.

[25] D. Klein,et al. Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[26] Dani Yogatama,et al. Adaptive Semiparametric Language Models , 2021, Transactions of the Association for Computational Linguistics.

[27] Omer Levy,et al. Few-Shot Question Answering by Pretraining Span Selection , 2021, ACL.

[28] Danqi Chen,et al. Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[29] Nicola De Cao,et al. A Memory Efficient Baseline for Open Domain Question Answering , 2020, ArXiv.

[30] Danqi Chen,et al. Learning Dense Representations of Phrases at Scale , 2020, ACL.

[31] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[32] Kenton Lee,et al. XOR QA: Cross-lingual Open-Retrieval Question Answering , 2020, NAACL.

[33] J. Shane Culpepper,et al. CC-News-En: A Large English News Corpus , 2020, CIKM.

[34] Noah A. Smith,et al. Grounded Compositional Outputs for Adaptive Language Modeling , 2020, EMNLP.

[35] Fabio Petroni,et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[36] Hinrich Schütze,et al. BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA , 2020, FINDINGS.

[37] Danqi Chen,et al. Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[38] Eunsol Choi,et al. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[39] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[40] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[41] Luke Zettlemoyer,et al. Zero-shot Entity Linking with Dense Entity Retrieval , 2019, ArXiv.

[42] Ulli Waltinger,et al. BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA , 2019, ArXiv.

[43] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[44] Omer Levy,et al. Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.

[45] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[46] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[47] Sebastian Riedel,et al. Language Models as Knowledge Bases? , 2019, EMNLP.

[48] Sanjiv Kumar,et al. Accelerating Large-Scale Inference with Anisotropic Vector Quantization , 2019, ICML.

[49] Ming-Wei Chang,et al. Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[50] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[51] Omer Levy,et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[52] Ali Farhadi,et al. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index , 2019, ACL.

[53] Ming-Wei Chang,et al. Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[54] Jiajun Zhang,et al. The Impact of Named Entity Translation for Neural Machine Translation , 2018, Communications in Computer and Information Science.

[55] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[56] Christophe Gravier,et al. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[57] Ali Farhadi,et al. Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension , 2018, EMNLP.

[58] Ruslan Salakhutdinov,et al. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[59] Wei Hu,et al. Cross-Lingual Entity Alignment via Joint Attribute-Preserving Embedding , 2017, SEMWEB.

[60] Jason Weston,et al. Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[61] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[62] Hany Hassan Awadalla,et al. Improving Named Entity Translation by Exploiting Comparable and Parallel Corpora , 2016 .

[63] Moustapha Cissé,et al. Efficient softmax approximation for GPUs , 2016, ICML.

[64] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[65] Wenlin Chen,et al. Strategies for Training Large Vocabulary Neural Language Models , 2015, ACL.

[66] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[67] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68] Jian Sun,et al. Optimized Product Quantization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[70] Jure Leskovec,et al. Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[71] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[72] David J. Fleet,et al. Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[73] Hugo Zaragoza,et al. The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[74] Bing Liu,et al. Mining and summarizing customer reviews , 2004, KDD.

[75] Bo Pang,et al. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[76] Bogdan Babych,et al. Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[77] Robert C. Moore. Learning Translations of Named-Entity Phrases from Parallel Corpora , 2003, EACL.

[78] William T. Freeman,et al. Example-Based Super-Resolution , 2002, IEEE Computer Graphics and Applications.

[79] Philip H. Ramsey. Nonparametric Statistical Methods , 1974, Technometrics.

[80] Jan-Christoph Kalo. KAMEL : Knowledge Analysis with Multitoken Entities in Language Models , 2022 .

[81] Luke Zettlemoyer,et al. Nearest Neighbor Zero-Shot Inference , 2022, EMNLP.

[82] Arun Tejasvi Chaganty,et al. Attributed Text Generation via Post-hoc Research and Revision , 2022, ArXiv.

[83] Luke Zettlemoyer,et al. Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models , 2022, ArXiv.

[84] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[85] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[86] Cordelia Schmid,et al. Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87] Yoshua Bengio,et al. Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[88] S. Shott,et al. Nonparametric Statistics , 2018, The Encyclopedia of Archaeological Sciences.