Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

The common practice for training commonsense models has gone from–human–to– corpus–to–machine: humans author commonsense knowledge graphs in order to train commonsense models. In this work, we investigate an alternative, from–machine–to–corpus– to–machine: general language models author these commonsense knowledge graphs to train commonsense models. Our study leads to a new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge Distillation (Hinton et al., 2015), our approach uses larger models to teach smaller models. A key difference is that we distill knowledge symbolically–as text–in addition to the neural model. We also distill only one aspect–the commonsense of a general language model teacher, allowing the student to be a different type, a commonsense model. Altogether, we show that careful prompt engineering and a separately trained critic model allow us to selectively distill high-quality causal commonsense from GPT-3, a general language model. Empirical results demonstrate that, for the first time, a human-authored commonsense knowledge graph is surpassed by our automatically distilled variant in all three criteria: quantity, quality, and diversity. In addition, it results in a neural commonsense model that surpasses the teacher model’s commonsense capabilities despite its 100x smaller size. We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models1.

[1]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[2]  Ronan Le Bras,et al.  Generative Data Augmentation for Commonsense Reasoning , 2020, EMNLP 2020.

[3]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[5]  Xiao Ding,et al.  Guided Generation of Cause and Effect , 2020, IJCAI.

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Yannis Papanikolaou,et al.  DARE: Data Augmented Relation Extraction with GPT-2 , 2020, ArXiv.

[8]  Yejin Choi,et al.  The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[9]  Jack Hessel,et al.  Does My Multimodal Model Learn Cross-modal Interactions? It’s Harder to Tell than You Might Think! , 2020, EMNLP.

[10]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[11]  Yejin Choi,et al.  CommonsenseQA 2.0: Exposing the Limits of AI through Gamification , 2021, NeurIPS Datasets and Benchmarks.

[12]  Yejin Choi,et al.  ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[13]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[14]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[15]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[16]  Dong Si,et al.  A Wizard-of-Oz Interface and Persona-based Methodology for Collecting Health Counseling Dialog , 2020, CHI Extended Abstracts.

[17]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[18]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[19]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[20]  Yejin Choi,et al.  Conversational Multi-Hop Reasoning with Neural Commonsense Knowledge and Symbolic Logic Rules , 2021, EMNLP.

[21]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[22]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[23]  Smaranda Muresan,et al.  MERMAID: Metaphor Generation with Symbolism and Discriminative Decoding , 2021, NAACL.

[24]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[25]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[26]  Jens Lehmann,et al.  DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia , 2015, Semantic Web.

[27]  Timo Schick,et al.  Generating Datasets with Pretrained Language Models , 2021, EMNLP.

[28]  Xin Liu,et al.  ASER: A Large-scale Eventuality Knowledge Graph , 2019, WWW.

[29]  Kenneth Heafield,et al.  N-gram Counts and Language Models from the Common Crawl , 2014, LREC.

[30]  Yejin Choi,et al.  COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs , 2020, AAAI.

[31]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[32]  Jason Weston,et al.  How to Motivate Your Dragon: Teaching Goal-Driven Agents to Speak and Act in Fantasy Worlds , 2020, NAACL.

[33]  Eunah Cho,et al.  Data Augmentation using Pre-trained Transformer Models , 2020, LIFELONGNLP.

[34]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[35]  Dan Roth,et al.  TransOMCS: From Linguistic Graphs to Commonsense Knowledge , 2020, IJCAI.

[36]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[37]  Dan Roth,et al.  Temporal Common Sense Acquisition with Minimal Supervision , 2020, ACL.

[38]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[39]  Ateret Anaby-Tavor,et al.  Do Not Have Enough Data? Deep Learning to the Rescue! , 2020, AAAI.

[40]  Ernest Davis,et al.  Causal generative models are just a start , 2017, Behavioral and Brain Sciences.

[41]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[42]  Roy Schwartz,et al.  Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand? , 2021, Transactions of the Association for Computational Linguistics.

[43]  Wenhan Xiong,et al.  Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model , 2019, ICLR.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Mark O. Riedl,et al.  Automated Storytelling via Causal, Commonsense Plot Ordering , 2021, AAAI.

[46]  Peter Clark,et al.  GenericsKB: A Knowledge Base of Generic Statements , 2020, ArXiv.

[47]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[48]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[49]  Ronan Le Bras,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[50]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.