On Knowledge distillation from complex networks for response prediction

Recent advances in Question Answering have lead to the development of very complex models which compute rich representations for query and documents by capturing all pairwise interactions between query and document words. This makes these models expensive in space and time, and in practice one has to restrict the length of the documents that can be fed to these models. Such models have also been recently employed for the task of predicting dialog responses from available background documents (e.g., Holl-E dataset). However, here the documents are longer, thereby rendering these complex models infeasible except in select restricted settings. In order to overcome this, we use standard simple models which do not capture all pairwise interactions, but learn to emulate certain characteristics of a complex teacher network. Specifically, we first investigate the conicity of representations learned by a complex model and observe that it is significantly lower than that of simpler models. Based on this insight, we modify the simple architecture to mimic this characteristic. We go further by using knowledge distillation approaches, where the simple model acts as a student and learns to match the output from the complex teacher network. We experiment with the Holl-E dialog data set and show that by mimicking characteristics and matching outputs from a teacher, even a simple network can give improved performance.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Yuxing Peng,et al.  Mnemonic Reader for Machine Comprehension , 2017, ArXiv.

[3]  Samuel R. Bowman,et al.  Ruminating Reader: Reasoning with Gated Multi-hop Attention , 2017, QA@ACL.

[4]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[5]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[6]  Mitesh M. Khapra,et al.  Towards Exploiting Background Knowledge for Building Conversation Systems , 2018, EMNLP.

[7]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[8]  Ruslan Salakhutdinov,et al.  Gated-Attention Readers for Text Comprehension , 2016, ACL.

[9]  Philip Bachman,et al.  Iterative Alternating Neural Attention for Machine Reading , 2016, ArXiv.

[10]  Aditya Sharma,et al.  Towards Understanding the Geometry of Knowledge Graph Embeddings , 2018, ACL.

[11]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Yelong Shen,et al.  ReasoNet: Learning to Stop Reading in Machine Comprehension , 2016, CoCo@NIPS.

[14]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[15]  Ming Zhou,et al.  S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension , 2017, AAAI 2017.

[16]  Nan Yang,et al.  Attention-Guided Answer Distillation for Machine Reading Comprehension , 2018, EMNLP.

[17]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[18]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[19]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[20]  Rudolf Kadlec,et al.  Text Understanding with the Attention Sum Reader Network , 2016, ACL.

[21]  Philip Bachman,et al.  Natural Language Comprehension with the EpiReader , 2016, EMNLP.

[22]  Mark J. F. Gales,et al.  Sequence Student-Teacher Training of Deep Neural Networks , 2016, INTERSPEECH.

[23]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[24]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[25]  Razvan Pascanu,et al.  Sobolev Training for Neural Networks , 2017, NIPS.

[26]  Furu Wei,et al.  AttSum: Joint Learning of Focusing and Summarization with Neural Attention , 2016, COLING.

[27]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.