The Effect of Masking Strategies on Knowledge Retention by Language Models

Language models retain a significant amount of world knowledge from their pre-training stage. This allows knowledgeable models to be applied to knowledge-intensive tasks prevalent in information retrieval, such as ranking or question answering. Understanding how and which factual information is acquired by our models is necessary to build responsible models. However, limited work has been done to understand the effect of pre-training tasks on the amount of knowledge captured and forgotten by language models during pre-training. Building a better understanding of knowledge acquisition is the goal of this paper. Therefore, we utilize a selection of pre-training tasks to infuse knowledge into our model. In the following steps, we test the model's knowledge retention by measuring its ability to answer factual questions. Our experiments show that masking entities and principled masking of correlated spans based on pointwise mutual information lead to more factual knowledge being retained than masking random tokens. Our findings demonstrate that, like the ability to perform a task, the (factual) knowledge acquired from being trained on that task is forgotten when a model is trained to perform another task (catastrophic forgetting) and how to prevent this phenomenon. To foster reproducibility, the code, as well as the data used in this paper, are openly available.

[1]  Stanley Jungkyu Choi,et al.  Towards Continual Knowledge Learning of Language Models , 2021, ICLR.

[2]  Abhishek Aich,et al.  Elastic Weight Consolidation (EWC): Nuts and Bolts , 2021, ArXiv.

[3]  Li Dong,et al.  Knowledge Neurons in Pretrained Transformers , 2021, ACL.

[4]  Madian Khabsa,et al.  On the Influence of Masking Policies in Intermediate Pre-training , 2021, EMNLP.

[5]  Yuxiang Wu,et al.  PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them , 2021, Transactions of the Association for Computational Linguistics.

[6]  Omer Levy,et al.  Transformer Feed-Forward Layers Are Key-Value Memories , 2020, EMNLP.

[7]  Yang Feng,et al.  Investigating Catastrophic Forgetting During Continual Training for Neural Machine Translation , 2020, COLING.

[8]  Avishek Anand,et al.  BERTnesia: Investigating the capture and forgetting of knowledge in BERT , 2020, BLACKBOXNLP.

[9]  Moshe Tennenholtz,et al.  PMI-Masking: Principled masking of correlated spans , 2020, ICLR.

[10]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[11]  Maksym Andriushchenko,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ICLR.

[12]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[13]  Noah D. Goodman,et al.  Investigating Transferability in Pretrained Language Models , 2020, FINDINGS.

[14]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[15]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[16]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[17]  Francisco S. Melo,et al.  Multi-task Learning and Catastrophic Forgetting in Continual Reinforcement Learning , 2019, GCAI.

[18]  Chengsheng Mao,et al.  KG-BERT: BERT for Knowledge Graph Completion , 2019, ArXiv.

[19]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[20]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[21]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[22]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[23]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[24]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[25]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[26]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[27]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[28]  Yonatan Belinkov,et al.  Probing Classifiers: Promises, Shortcomings, and Alternatives , 2021, ArXiv.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .