Preserving Commonsense Knowledge from Pre-trained Language Models via Causal Inference

Fine-tuning has been proven to be a simple and effective technique to transfer the learned knowledge of Pre-trained Language Models (PLMs) to downstream tasks. However, vanilla fine-tuning easily overfits the target data and degrades the generalization ability. Most existing studies attribute it to catastrophic forgetting, and they retain the pre-trained knowledge indiscriminately without identifying what knowledge is transferable. Motivated by this, we frame fine-tuning into a causal graph and discover that the crux of catastrophic forgetting lies in the missing causal effects from the pretrained data. Based on the causal view, we propose a unified objective for fine-tuning to retrieve the causality back. Intriguingly, the unified objective can be seen as the sum of the vanilla fine-tuning objective, which learns new knowledge from target data, and the causal objective, which preserves old knowledge from PLMs. Therefore, our method is flexible and can mitigate negative transfer while preserving knowledge. Since endowing models with commonsense is a long-standing challenge, we implement our method on commonsense QA with a proposed heuristic estimation to verify its effectiveness. In the experiments, our method outperforms state-of-the-art fine-tuning methods on all six commonsense QA datasets and can be implemented as a plug-in module to inflate the performance of existing QA models.

[1]  Qianli Ma,et al.  Distilling Causal Effect from Miscellaneous Other-Class for Continual Named Entity Recognition , 2022, EMNLP.

[2]  Yejin Choi,et al.  Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering , 2022, EMNLP.

[3]  Wayne Xin Zhao,et al.  Great Truths are Always Simple: A Rather Simple Knowledge Encoder for Enhancing the Commonsense Reasoning Capacity of Pre-Trained Models , 2022, ArXiv.

[4]  Mona T. Diab,et al.  A Review on Language Models as Knowledge Bases , 2022, ArXiv.

[5]  Chuhan Wu,et al.  NoisyTune: A Little Noise Can Help You Finetune Pretrained Language Models Better , 2022, ACL.

[6]  Ronan Le Bras,et al.  Generated Knowledge Prompting for Commonsense Reasoning , 2021, ACL.

[7]  Songfang Huang,et al.  Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning , 2021, EMNLP.

[8]  Rui Qiao,et al.  Uncovering Main Causalities for Long-tailed Information Extraction , 2021, EMNLP.

[9]  Wenkai Zhang,et al.  De-biasing Distantly Supervised Named Entity Recognition via Causal Intervention , 2021, ACL.

[10]  J. Leskovec,et al.  QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering , 2021, NAACL.

[11]  Chunyan Miao,et al.  Distilling Causal Effect of Data in Class-Incremental Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xuedong Huang,et al.  Fusing Context Into Knowledge Graph for Commonsense Question Answering , 2020, FINDINGS.

[13]  Hanwang Zhang,et al.  Long-Tailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect , 2020, Neural Information Processing Systems.

[14]  Jinhui Tang,et al.  Causal Intervention for Weakly-Supervised Semantic Segmentation , 2020, NeurIPS.

[15]  Armen Aghajanyan,et al.  Better Fine-Tuning by Reducing Representational Collapse , 2020, ICLR.

[16]  Kilian Q. Weinberger,et al.  Revisiting Few-sample BERT Fine-tuning , 2020, ICLR.

[17]  Maksym Andriushchenko,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ICLR.

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[20]  Pedro A. Szekely,et al.  Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering , 2020, FINDINGS.

[21]  Jun Yan,et al.  Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering , 2020, EMNLP.

[22]  Wanxiang Che,et al.  Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting , 2020, EMNLP.

[23]  Ronan Le Bras,et al.  G-DAug: Generative Data Augmentation for Commonsense Reasoning , 2020, FINDINGS.

[24]  Ting Liu,et al.  Counterfactual Off-Policy Training for Neural Dialogue Generation , 2020, EMNLP.

[25]  Hanwang Zhang,et al.  Visual Commonsense R-CNN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[27]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[28]  B. Schölkopf,et al.  Causality for Machine Learning , 2019, Probabilistic and Causal Inference.

[29]  Ashish Sabharwal,et al.  QASC: A Dataset for Question Answering via Sentence Composition , 2019, AAAI.

[30]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32]  Kyunghyun Cho,et al.  Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models , 2019, ICLR.

[33]  Xiang Ren,et al.  KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning , 2019, EMNLP.

[34]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[35]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[36]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[37]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[38]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[39]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[40]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[41]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[42]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[43]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[44]  Christoph H. Lampert,et al.  iCaRL: Incremental Classifier and Representation Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[46]  Oren Etzioni,et al.  Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions , 2016, AAAI.

[47]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[48]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[49]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[50]  J BERKSON,et al.  Limitations of the application of fourfold table analysis to hospital data. , 1946, Biometrics.

[51]  Illtyd Trethowan Causality , 1938 .

[52]  Xinyang Chen,et al.  Catastrophic Forgetting Meets Negative Transfer: Batch Spectral Shrinkage for Safe Transfer Learning , 2019, NeurIPS.

[53]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[54]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[55]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[56]  Gavriel Salomon,et al.  T RANSFER OF LEARNING , 1992 .