Nonparametric Masked Language Modeling

Existing language models (LMs) predict tokens with a softmax over a finite vocabulary, which can make it difficult to predict rare tokens or phrases. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. NPM fills in the [MASK] solely from retrieving a token from a text corpus. We show that NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval. Zero-shot evaluation on 16 tasks including classification, fact probing and question answering demonstrates that NPM outperforms significantly larger parametric models, either with or without a retrieve-and-generate approach. It is particularly better at dealing with rare patterns (word senses or facts) and predicting rare or nearly unseen words (e.g., non-Latin script). We release the model and code at github.com/facebookresearch/NPM.

[1]  Yan Wang,et al.  Copy is All You Need , 2023, ICLR.

[2]  Colin Raffel,et al.  Large Language Models Struggle to Learn Long-Tail Knowledge , 2022, ICML.

[3]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[4]  Mohammad Sadegh Rasooli,et al.  Bidirectional Language Models Are Also Few-shot Learners , 2022, ICLR.

[5]  Edouard Grave,et al.  PEER: A Collaborative Language Model , 2022, ICLR.

[6]  Jane A. Yu,et al.  Few-shot Learning with Retrieval Augmented Language Models , 2022, J. Mach. Learn. Res..

[7]  Danqi Chen,et al.  Training Language Models with Memory Augmentation , 2022, EMNLP.

[8]  Naman Goyal,et al.  On the Role of Bidirectionality in Language Model Pre-Training , 2022, EMNLP.

[9]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[10]  Minjoon Seo,et al.  TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models , 2022, EMNLP.

[11]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[12]  D. Klein,et al.  Describing Differences between Text Distributions with Natural Language , 2022, ICML.

[13]  Rajarshi Bhowmik,et al.  Learning Rich Representation of Keyphrases from Text , 2021, NAACL-HLT.

[14]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[15]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[16]  Zaiqiao Meng,et al.  TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning , 2021, NAACL-HLT.

[17]  Frank F. Xu,et al.  Capturing Structural Locality in Non-parametric Language Models , 2021, ICLR.

[18]  Alyssa Lees,et al.  ReasonBERT: Pre-trained to Reason with Distant Supervision , 2021, EMNLP.

[19]  Taylor Berg-Kirkpatrick,et al.  Efficient Nearest Neighbor Language Models , 2021, EMNLP.

[20]  Jimmy J. Lin,et al.  Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations , 2021, SIGIR.

[21]  William W. Cohen,et al.  Time-Aware Language Models as Temporal Knowledge Bases , 2021, TACL.

[22]  Ikuya Yamada,et al.  Efficient Passage Retrieval with Hashing for Open-domain Question Answering , 2021, ACL.

[23]  Luke Zettlemoyer,et al.  Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , 2021, EMNLP.

[24]  Zexuan Zhong,et al.  Factual Probing Is [MASK]: Learning vs. Learning to Recall , 2021, NAACL.

[25]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[26]  Dani Yogatama,et al.  Adaptive Semiparametric Language Models , 2021, Transactions of the Association for Computational Linguistics.

[27]  Omer Levy,et al.  Few-Shot Question Answering by Pretraining Span Selection , 2021, ACL.

[28]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[29]  Nicola De Cao,et al.  A Memory Efficient Baseline for Open Domain Question Answering , 2020, ArXiv.

[30]  Danqi Chen,et al.  Learning Dense Representations of Phrases at Scale , 2020, ACL.

[31]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[32]  Kenton Lee,et al.  XOR QA: Cross-lingual Open-Retrieval Question Answering , 2020, NAACL.

[33]  J. Shane Culpepper,et al.  CC-News-En: A Large English News Corpus , 2020, CIKM.

[34]  Noah A. Smith,et al.  Grounded Compositional Outputs for Adaptive Language Modeling , 2020, EMNLP.

[35]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[36]  Hinrich Schütze,et al.  BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA , 2020, FINDINGS.

[37]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[38]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[39]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[40]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[41]  Luke Zettlemoyer,et al.  Zero-shot Entity Linking with Dense Entity Retrieval , 2019, ArXiv.

[42]  Ulli Waltinger,et al.  BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA , 2019, ArXiv.

[43]  Myle Ott,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[44]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.

[45]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[46]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[47]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[48]  Sanjiv Kumar,et al.  Accelerating Large-Scale Inference with Anisotropic Vector Quantization , 2019, ICML.

[49]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[50]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[51]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[52]  Ali Farhadi,et al.  Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index , 2019, ACL.

[53]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[54]  Jiajun Zhang,et al.  The Impact of Named Entity Translation for Neural Machine Translation , 2018, Communications in Computer and Information Science.

[55]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[56]  Christophe Gravier,et al.  T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[57]  Ali Farhadi,et al.  Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension , 2018, EMNLP.

[58]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[59]  Wei Hu,et al.  Cross-Lingual Entity Alignment via Joint Attribute-Preserving Embedding , 2017, SEMWEB.

[60]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[61]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[62]  Hany Hassan Awadalla,et al.  Improving Named Entity Translation by Exploiting Comparable and Parallel Corpora , 2016 .

[63]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[64]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[65]  Wenlin Chen,et al.  Strategies for Training Large Vocabulary Neural Language Models , 2015, ACL.

[66]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[67]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[68]  Jian Sun,et al.  Optimized Product Quantization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[70]  Jure Leskovec,et al.  Hidden factors and hidden topics: understanding rating dimensions with review text , 2013, RecSys.

[71]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[72]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[74]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[75]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[76]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[77]  Robert C. Moore Learning Translations of Named-Entity Phrases from Parallel Corpora , 2003, EACL.

[78]  William T. Freeman,et al.  Example-Based Super-Resolution , 2002, IEEE Computer Graphics and Applications.

[79]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[80]  Jan-Christoph Kalo KAMEL : Knowledge Analysis with Multitoken Entities in Language Models , 2022 .

[81]  Luke Zettlemoyer,et al.  Nearest Neighbor Zero-Shot Inference , 2022, EMNLP.

[82]  Arun Tejasvi Chaganty,et al.  Attributed Text Generation via Post-hoc Research and Revision , 2022, ArXiv.

[83]  Luke Zettlemoyer,et al.  Language Contamination Explains the Cross-lingual Capabilities of English Pretrained Models , 2022, ArXiv.

[84]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[85]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[86]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[88]  S. Shott,et al.  Nonparametric Statistics , 2018, The Encyclopedia of Archaeological Sciences.