Heterogeneous Knowledge Distillation Using Information Flow Modeling

Knowledge Distillation (KD) methods are capable of transferring the knowledge encoded in a large and complex teacher into a smaller and faster student. Early methods were usually limited to transferring the knowledge only between the last layers of the networks, while latter approaches were capable of performing multi-layer KD, further increasing the accuracy of the student. However, despite their improved performance, these methods still suffer from several limitations that restrict both their efficiency and flexibility. First, existing KD methods typically ignore that neural networks undergo through different learning phases during the training process, which often requires different types of supervision for each one. Furthermore, existing multi-layer KD methods are usually unable to effectively handle networks with significantly different architectures (heterogeneous KD). In this paper we propose a novel KD method that works by modeling the information flow through the various layers of the teacher model and then train a student model to mimic this information flow. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process, as well as by designing and training an appropriate auxiliary teacher model that acts as a proxy model capable of “explaining” the way the teacher works to the student. The effectiveness of the proposed method is demonstrated using four image datasets and several different evaluation setups.

[1]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Anastasios Tefas,et al.  Learning Deep Representations with Probabilistic Knowledge Transfer , 2018, ECCV.

[4]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Anastasios Tefas,et al.  Unsupervised Knowledge Transfer Using Similarity Embeddings , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Luca Benini,et al.  An Open Source and Open Hardware Deep Learning-Powered Visual Navigation Engine for Autonomous Nano-UAVs , 2019, 2019 15th International Conference on Distributed Computing in Sensor Systems (DCOSS).

[7]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[8]  Zhiguang Cao,et al.  Distilling the Knowledge From Handcrafted Features for Human Activity Recognition , 2018, IEEE Transactions on Industrial Informatics.

[9]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[10]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[11]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Dong Wang,et al.  Knowledge Transfer Pre-training , 2015, ArXiv.

[14]  Joost van de Weijer,et al.  Learning Metrics From Teachers: Compact Networks for Image Embedding , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Luca Benini,et al.  Ultra Low Power Deep-Learning-powered Autonomous Nano Drones , 2018, ArXiv.

[16]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[17]  Stefano Soatto,et al.  Critical Learning Periods in Deep Neural Networks , 2017, ArXiv.

[18]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[22]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[23]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Trevor Darrell,et al.  Simultaneous Deep Transfer Across Domains and Tasks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Tristan Perez,et al.  Mixtures of Lightweight Deep Convolutional Neural Networks: Applied to Agricultural Robotics , 2017, IEEE Robotics and Automation Letters.

[27]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Byung Cheol Song,et al.  Self-supervised Knowledge Distillation Using Singular Value Decomposition , 2018, ECCV.

[29]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[30]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[31]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[32]  Hassan Ghasemzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher , 2019, ArXiv.

[33]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[34]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[35]  Neil D. Lawrence,et al.  Variational Information Distillation for Knowledge Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Zhiyuan Tang,et al.  Recurrent neural network training with dark knowledge transfer , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).