Few Shot Network Compression via Cross Distillation

Model compression has been widely adopted to obtain light-weighted deep neural networks. Most prevalent methods, however, require fine-tuning with sufficient training data to ensure accuracy, which could be challenged by privacy and security issues. As a compromise between privacy and performance, in this paper we investigate few shot network compression: given few samples per class, how can we effectively compress the network with negligible performance drop? The core challenge of few shot network compression lies in high estimation errors from the original network during inference, since the compressed network can easily over-fits on the few training instances. The estimation errors could propagate and accumulate layer-wisely and finally deteriorate the network output. To address the problem, we propose cross distillation, a novel layer-wise knowledge distillation approach. By interweaving hidden layers of teacher and student network, layer-wisely accumulated estimation errors can be effectively reduced. The proposed method offers a general framework compatible with prevalent network compression techniques such as pruning. Extensive experiments n benchmark datasets demonstrate that cross distillation can significantly improve the student network's accuracy when only a few training instances are available.

[1]  Jian Sun,et al.  Accelerating Very Deep Convolutional Networks for Classification and Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Sinno Jialin Pan,et al.  Deep Neural Network Quantization via Layer-Wise Optimization Using Limited Training Data , 2019, AAAI.

[3]  Zenglin Xu,et al.  Learning Compact Recurrent Neural Networks with Block-Term Tensor Decomposition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Thad Starner,et al.  Data-Free Knowledge Distillation for Deep Neural Networks , 2017, ArXiv.

[5]  Kartikeya Bhardwaj,et al.  Dream Distillation: A Data-Independent Model Compression Framework , 2019, ArXiv.

[6]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[7]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[8]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Jianxin Wu,et al.  ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[11]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[12]  Wei Liu,et al.  PocketFlow: An Automated Framework for Compressing and Accelerating Deep Neural Networks , 2018 .

[13]  Elad Hoffer,et al.  Scalable Methods for 8-bit Training of Neural Networks , 2018, NeurIPS.

[14]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jitendra Malik,et al.  Cross Modal Distillation for Supervision Transfer , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[17]  Sinno Jialin Pan,et al.  Cooperative Pruning in Cross-Domain Deep Neural Network Compression , 2019, IJCAI.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  R. Venkatesh Babu,et al.  Zero-Shot Knowledge Distillation in Deep Networks , 2019, ICML.

[20]  Wei Wang,et al.  RTN: Reparameterized Ternary Network , 2019, AAAI.

[21]  Xin Dong,et al.  Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon , 2017, NIPS.

[22]  Zenglin Xu,et al.  Structured Pruning of Recurrent Neural Networks through Neuron Selection , 2020, Neural Networks.

[23]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[24]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[25]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[26]  Chao Xu,et al.  Data-Free Learning of Student Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Changshui Zhang,et al.  Few Sample Knowledge Distillation for Efficient Network Compression , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Changshui Zhang,et al.  Knowledge Distillation from Few Samples , 2018, ArXiv.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yu Bai,et al.  ProxQuant: Quantized Neural Networks via Proximal Operators , 2018, ICLR.

[31]  Paul Tseng,et al.  Exact Regularization of Convex Programs , 2007, SIAM J. Optim..

[32]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..