Natural Statistics Of Network Activations And Implications For Knowledge Distillation

In a matter that is analog to the study of natural image statistics, we study the natural statistics of the deep neural network activations at various layers. As we show, these statistics, similar to image statistics, follow a power law. We also show, both analytically and empirically, that with depth the exponent of this power law increases at a linear rate. As a direct implication of our discoveries, we present a method for performing Knowledge Distillation (KD). While classical KD methods consider the logits of the teacher network, more recent methods obtain a leap in performance by considering the activation maps. This, however, uses metrics that are suitable for comparing images. We propose to employ two additional loss terms that are based on the spectral properties of the intermediate activation maps. The proposed method obtains state of the art results on multiple image recognition KD benchmarks.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[3]  D. Tolhurst,et al.  Amplitude spectra of natural images. , 1992, Ophthalmic & physiological optics : the journal of the British College of Ophthalmic Opticians.

[4]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[6]  Fabio Anselmi,et al.  Visual Cortex and Deep Networks: Learning Invariant Representations , 2016 .

[7]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[9]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[10]  François Fleuret,et al.  Knowledge Transfer with Jacobian Matching , 2018, ICML.

[11]  Sangdoo Yun,et al.  A Comprehensive Overhaul of Feature Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  C. K. Yuen,et al.  Theory and Application of Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  R. P. Millane,et al.  Scaling and Power Spectra of Natural Images , 2003 .

[15]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[16]  Jangho Kim,et al.  Paraphrasing Complex Network: Network Compression via Factor Transfer , 2018, NeurIPS.

[17]  Junmo Kim,et al.  Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[19]  Daniel L. Ruderman,et al.  Origins of scaling in natural images , 1996, Vision Research.

[20]  Jin Young Choi,et al.  Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons , 2018, AAAI.

[21]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.