The distance between the weights of the neural network is meaningful

In the application of neural networks, we need to select a suitable model based on the problem complexity and the dataset scale. To analyze the network’s capacity, quantifying the information learned by the network is necessary. This paper proves that the distance between the neural network weights in different training stages can be used to estimate the information accumulated by the network in the training process directly. The experiment results verify the utility of this method. An application of this method related to the label corruption is shown at the end.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[4]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[5]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[6]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[7]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[8]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[9]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[10]  J. Marden Analyzing and Modeling Rank Data , 1996 .

[11]  Andrew W. Mead Review of the Development of Multidimensional Scaling Methods , 1992 .

[12]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[13]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[14]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[15]  Robert Jenssen,et al.  Understanding Convolutional Neural Networks With Information Theory: An Initial Exploration , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[19]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[20]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[21]  John von Neumann,et al.  Mathematical Foundations of Quantum Mechanics: New Edition , 2018 .

[22]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[25]  Dirk P. Kroese,et al.  Why the Monte Carlo method is so important today , 2014 .

[26]  Bernhard Schölkopf,et al.  EnhanceNet: Single Image Super-Resolution Through Automated Texture Synthesis , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Stefano Soatto,et al.  Where is the Information in a Deep Neural Network? , 2019, ArXiv.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[31]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[32]  I. N. Marshall,et al.  Who's Afraid of Schrödinger's Cat? An A-to-Z Guide to All the New Science Ideas You Need to Keep Up with the New Thinking , 1997 .

[33]  Sungzoon Cho,et al.  Variational Autoencoder based Anomaly Detection using Reconstruction Probability , 2015 .

[34]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[35]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[36]  James T. Kwok,et al.  Generalizing from a Few Examples , 2019, ACM Comput. Surv..

[37]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[38]  D. S. Jones,et al.  Elementary information theory , 1979 .

[39]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[40]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.