Spectral Roll-off Points Variations: Exploring Useful Information in Feature Maps by Its Variations

Useful information (UI) is an elusive concept in neural networks. A quantitative measurement of UI is absent, despite the variations of UI can be recognized by prior knowledge. The communication bandwidth of feature maps decreases after downscaling operations, but UI flows smoothly after training due to lower Nyquist frequency. Inspired by the low-Nyqusit-frequency nature of UI, we propose the use of spectral roll-off points (SROPs) to estimate UI on variations. The computation of an SROP is extended from a 1-D signal to a 2-D image by the required rotation invariance in image classification tasks. SROP statistics across feature maps are implemented as layer-wise useful information estimates. We design sanity checks to explore SROP variations when UI variations are produced by variations in model input, model architecture and training stages. The variations of SROP is synchronizes with UI variations in various randomized and sufficiently trained model structures. Therefore, SROP variations is an accurate and convenient sign of UI variations, which promotes the explainability of data representations with respect to frequencydomain knowledge. With prior knowledge, variations of useful information can be clear, despite the amount of useful information remains opaque. Downsampling blocks is a common block in DNNs, it filters high-frequencies directly. The sampling theorem states that these operations, which always halve the Beijing Institute of Technology, Beijing, China Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences, Beijing, China Tsinghua University, Beijing, China Global Health Drug Discovery Institute, Beijing, China PAII Inc, Palo Alto, USA. Correspondence to: Yuyang You <arthurwy@163.com>. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. The template is a modification of the ICML 2020 template. spatial resolution (sampling frequency), lead to a decline of Nyquist frequency in consecutive feature maps (Cover & Thomas, 1990; Shannon & Weaver, 1949). Well-trained models possess more useful information than randomized models, so their downsampling blocks eliminate less useful information. Therefore, we can conclude that useful information in feature maps is in a low-Nyquist-frequency form. This property is denoted as low-Nyquist-frequency prior (LFP). Motivated by LFP, we propose the use of spectral roll-off points (SROPs), which quantify feature energy in low-frequency bands. It can be used as an estimate of useful information when its variations can be regarded as a sign of useful information variations. Some obstacles in computation need to be tackled in order to obtain estimates of 3-D feature maps, because SROP is computed by analyzing a 1-D spectrum of a 1-D signal. Transforming the 2-D spectrum to 1-D spectrum via a radial average method enables us to extend the SROP computation to a 2-D feature map. This modification is consistent with the desired property in image classification tasks that welltrained DNNs possess rotation invariance (Dieleman et al., 2016; Goodfellow et al.; Lenc & Vedaldi, 2015). SROP statistics among kernels are adopted in the estimation of layer-wise useful information. This design substantially simplifies the computation load. Experiments provide empirical evidence which confirms that SROPs variations can trace the variations of useful information. Three factors that relate to useful information variations, model input, model structures and sufficient training, are summarized from previous work (Goodfellow et al.; Koh & Liang, 2017; Larochelle et al., 2007; Yak et al., 2019). Control experiments are designed under the sanity check framework (Adebayo et al.). Everyday-object and digit images are synthesized with different proportions, which produces variations in the patterns of useful information or in noise intensity. The SROPs of feature maps see a synchronous change. The use of SROP statistics explains the effectiveness of downscaling, batch normalization (BN), anti-aliased blocks and intermediate layers. Layerwise SROP curves visualize the flows of useful information in multiple modern model architectures. Comparisons between randomized and pre-trained models demonstrate that ar X iv :2 10 2. 00 36 9v 2 [ cs .C V ] 1 2 A ug 2 02 1 Spectral Roll-off Points Variations: Exploring Useful Information in Feature Maps by Its Variations low-Nyquist-frequency data representations are the results of sufficient training. Variations of SROPs and useful information are proved to be consistent. The feasibility of SROP statistics, potential applications and limitations of SROPs are discussed in Sec. 6. Our contributions are summarized as follows: • We measure useful information by its variations, and propose the use of SROPs variations according to LFP. The computation of SROPs is extended from a 1-D signal to DNN feature maps. Consequently, we can explain layer-wise useful information with frequencydomain knowledge. • Systematic analysis proves that SROP variations is a sign of useful information variations in a layer-wise level. The cause of useful information variation covers model input, model structures and sufficient training. Experiments include 20 modern DNNs and 3 largescale datasets. • Theoretical analysis and empirical evidence show that SROP separates redundant information from useful information. The computation of SROP requires only a single sample, which makes it a practical layer-wise measurement to promote the explainability of DNNs.

[1]  Been Kim,et al.  Sanity Checks for Saliency Maps , 2018, NeurIPS.

[2]  Stefano Ermon,et al.  A Theory of Usable Information Under Computational Constraints , 2020, ICLR.

[3]  Hanna Mazzawi,et al.  Towards Task and Architecture-Independent Generalization Gap Predictors , 2019, ArXiv.

[4]  Koray Kavukcuoglu,et al.  Exploiting Cyclic Symmetry in Convolutional Neural Networks , 2016, ICML.

[5]  M. Degroot Uncertainty, Information, and Sequential Experiments , 1962 .

[6]  Zheng Ma,et al.  Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , 2019, Communications in Computational Physics.

[7]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[8]  Andrea Vedaldi,et al.  Understanding Image Representations by Measuring Their Equivariance and Equivalence , 2014, International Journal of Computer Vision.

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[11]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[12]  Eric P. Xing,et al.  High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[14]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[15]  Quanshi Zhang,et al.  Explaining Knowledge Distillation by Quantifying the Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Zheng Zhang,et al.  Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation , 2020, ECCV.

[17]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[18]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[19]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[20]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[21]  Zhi-Qin John Xu,et al.  Training behavior of deep neural network in frequency domain , 2018, ICONIP.

[22]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Antonio Politi,et al.  Hausdorff Dimension and Uniformity Factor of Strange Attractors , 1984 .

[24]  Alessandro Laio,et al.  Estimating the intrinsic dimension of datasets by a minimal neighborhood information , 2017, Scientific Reports.

[25]  S. Maus The geomagnetic power spectrum , 2008 .

[26]  Gintare Karolina Dziugaite,et al.  Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates , 2019, NeurIPS.

[27]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[28]  Kai Xu,et al.  Learning in the Frequency Domain , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[30]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[31]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[32]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[34]  Quanshi Zhang,et al.  Knowledge Consistency between Neural Networks and Beyond , 2019, ICLR.

[35]  Michele Parrinello,et al.  Using sketch-map coordinates to analyze and bias molecular dynamics simulations , 2012, Proceedings of the National Academy of Sciences.

[36]  E. K. Lenzi,et al.  Statistical mechanics based on Renyi entropy , 2000 .

[37]  Alessandro Laio,et al.  Intrinsic dimension of data representations in deep neural networks , 2019, NeurIPS.

[38]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.