Mind the Gap: Assessing Temporal Generalization in Neural Language Models

Our world is open-ended, non-stationary, and constantly evolving; thus what we talk about and how we talk about it change over time. This inherent dynamic nature of language contrasts with the current static language modelling paradigm, which trains and evaluates models on utterances from overlapping time periods. Despite impressive recent progress, we demonstrate that Transformer-XL language models perform worse in the realistic setup of predicting future utterances from beyond their training period, and that model performance becomes increasingly worse with time. We find that, while increasing model size alone—a key driver behind recent progress—does not solve this problem, having models that continually update their knowledge with new information can indeed mitigate this performance degradation over time. Hence, given the compilation of ever-larger language modelling datasets, combined with the growing list of language-model-based NLP applications that require up-to-date factual knowledge about the world, we argue that now is the right time to rethink the static way in which we currently train and evaluate our language models, and develop adaptive language models that can remain up-to-date with respect to our ever-changing and non-stationary world. We will publicly release our dynamic, streaming language modelling benchmarks for WMT and ARXIV to facilitate language model evaluation that takes temporal dynamics into account.1

[1]  Ankit Singh Rawat,et al.  Modifying Memories in Transformer Models , 2020, ArXiv.

[2]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[3]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2020, ICLR.

[4]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[5]  Yoav Goldberg,et al.  BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[6]  Shruti Rijhwani,et al.  Temporally-Informed Analysis of Named Entity Recognition , 2020, ACL.

[7]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[8]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[9]  Chen Liang,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[10]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[11]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[12]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[13]  Vicente Ordonez,et al.  Bias and Fairness in Natural Language Processing , 2019, EMNLP/IJCNLP.

[14]  Dan Klein,et al.  Cross-Domain Generalization of Neural Constituency Parsers , 2019, ACL.

[15]  Pieter Abbeel,et al.  Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments , 2017, ICLR.

[16]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[17]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[18]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[19]  Steve Renals,et al.  Dynamic Evaluation of Transformer Language Models , 2019, ArXiv.

[20]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[23]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[24]  Sebastian Ruder,et al.  Episodic Memory in Lifelong Language Learning , 2019, NeurIPS.

[25]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[26]  OctoMiao Overcoming catastrophic forgetting in neural networks , 2016 .

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[29]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[30]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[31]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[32]  Ashwin Lall,et al.  Exponential Reservoir Sampling for Streaming Language Models , 2014, ACL.

[33]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[34]  Andrei A. Rusu,et al.  Embracing Change: Continual Learning in Deep Neural Networks , 2020, Trends in Cognitive Sciences.

[35]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[36]  Ryan Cotterell,et al.  What Kind of Language Is Hard to Language-Model? , 2019, ACL.

[37]  Chong Wang,et al.  Dynamic Language Models for Streaming Text , 2014, TACL.

[38]  Anders Søgaard,et al.  Sentiment analysis under temporal shift , 2018, WASSA@EMNLP.

[39]  Fan-Keng Sun,et al.  LAMOL: LAnguage MOdeling for Lifelong Language Learning , 2020, ICLR.

[40]  Katrin Erk,et al.  Deep Neural Models of Semantic Shift , 2018, NAACL-HLT.

[41]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[42]  Dani Yogatama,et al.  Adaptive Semiparametric Language Models , 2021, Transactions of the Association for Computational Linguistics.

[43]  Suresh Venkatasubramanian,et al.  Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[44]  Steve Renals,et al.  Dynamic Evaluation of Neural Sequence Models , 2017, ICML.

[45]  Kyunghyun Cho,et al.  SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , 2017, ArXiv.

[46]  Percy Liang,et al.  Distributionally Robust Language Modeling , 2019, EMNLP.

[47]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Guangquan Zhang,et al.  Learning under Concept Drift: A Review , 2019, IEEE Transactions on Knowledge and Data Engineering.

[50]  Yue Zhang,et al.  Deep Learning for Event-Driven Stock Prediction , 2015, IJCAI.

[51]  Marc'Aurelio Ranzato,et al.  Gradient Episodic Memory for Continual Learning , 2017, NIPS.

[52]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[53]  Isabelle Augenstein,et al.  Back to the Future - Temporal Adaptation of Text Representations , 2020, AAAI.

[54]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[55]  Dirk Hovy,et al.  Crowdsourcing and annotating NER for Twitter #drift , 2014, LREC.

[56]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[57]  Zi Yin The Global Anchor Method for Quantifying Linguistic Shifts and Domain Adaptation , 2018 .

[58]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[59]  Andreas Vlachos,et al.  Automated Fact Checking: Task Formulations, Methods and Future Directions , 2018, COLING.

[60]  Sebastian Riedel,et al.  Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets , 2020, EACL.

[61]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[62]  Jure Leskovec,et al.  Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change , 2016, ACL.

[63]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[64]  Terrence Szymanski,et al.  Temporal Word Analogies: Identifying Lexical Replacement with Diachronic Word Embeddings , 2017, ACL.

[65]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[66]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[67]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[68]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[69]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[70]  Christian Hansen,et al.  MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims , 2019, EMNLP.

[71]  Chris Callison-Burch,et al.  Stream-based Translation Models for Statistical Machine Translation , 2010, NAACL.

[72]  Anders Sogaard,et al.  We Need To Talk About Random Splits , 2020, EACL.

[73]  Christopher Potts,et al.  DynaSent: A Dynamic Benchmark for Sentiment Analysis , 2020, ACL.

[74]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[75]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[76]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[77]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[78]  Artem Babenko,et al.  Editable Neural Networks , 2020, ICLR.

[79]  Nicola De Cao,et al.  Editing Factual Knowledge in Language Models , 2021, EMNLP.

[80]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[81]  Bernard Mérialdo,et al.  A Dynamic Language Model for Speech Recognition , 1991, HLT.

[82]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[83]  Andrew Zisserman,et al.  Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[85]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[86]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[87]  Isabelle Augenstein,et al.  Back to the Future - Sequential Alignment of Text Representations , 2019, ArXiv.