Variational Student: Learning Compact and Sparser Networks In Knowledge Distillation Framework

The holy grail in deep neural network research is porting the memory- and computation-intensive network models on embedded platforms with a minimal compromise in model accuracy. To this end, we propose Variational Student where we reap the benefits of compressibility of the knowledge distillation framework, and sparsity inducing abilities of variational inference (VI) techniques. Essentially, we build an accurate and sparse student network, whose sparsity is induced by the variational parameters found via optimizing a loss function based on VI, leveraging the knowledge learnt by an accurate but complex pre-trained teacher network. Further, for sparsity enhancement, we also employ a Block Sparse Regularizer on a concatenated tensor of teacher and student network weights. We benchmark our results on MLP and CNN variants and illustrate an improved performance in lowering the memory footprint up to ∼ 213× without a need to retrain the teacher network.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Xiaogang Wang,et al.  Face Model Compression by Distilling Knowledge from Neurons , 2016, AAAI.

[3]  Lei Zhang,et al.  Variational Bayesian Dropout , 2018, ArXiv.

[4]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[5]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[6]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[7]  Fang Liu,et al.  Learning Intrinsic Sparse Structures within Long Short-term Memory , 2017, ICLR.

[8]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[9]  Ulrike von Luxburg,et al.  Proceedings of the 28th International Conference on Machine Learning, ICML 2011 , 2011, International Conference on Machine Learning.

[10]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[11]  Vivek Rathod,et al.  Bayesian dark knowledge , 2015, NIPS.

[12]  Max Welling,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS 2015.

[13]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[14]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[15]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[16]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[17]  Lei Zhang,et al.  Variational Bayesian Dropout With a Hierarchical Prior , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[19]  Dimitrios Gunopulos,et al.  Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , 2006, KDD 2006.

[20]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[21]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[22]  Zoubin Ghahramani,et al.  Variational Bayesian dropout: pitfalls and fixes , 2018, ICML.

[23]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[24]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[25]  Tao Zhang,et al.  Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges , 2018, IEEE Signal Processing Magazine.

[26]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[27]  Yiran Chen,et al.  Learning Structured Sparsity in Deep Neural Networks , 2016, NIPS.

[28]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.