Penetrating the influence of regularizations on neural network based on information bottleneck theory

Abstract Regularization is a very effective algorithm to solve overfitting problem in neural network, which improves the generalization ability of the model. However, their working mechanisms and the impact on the model performance have not been fully explored. In this paper, we study and analyze them using information bottleneck theory and one theory from human brain sensory system. We propose a metric to characterise the encoding length of hidden layers, named as AEntry value. Then, we implement extensive experiments on MNIST and FashionMNIST datasets with several commonly used regularization algorithms, and calculate the corresponding AEntry values. We analyze these results and obtain three conclusions. (1) The introduction of regularization influences the encoding of relative features with prediction task in neural network. The early stopping technique avoids introducing unrelated information with the task into the model by stopping the training process as an appropriate iterations. Laplace, Gaussian and Sparse Response regularizations compress the related representation and improve the performance of neural network by introducing the prior information into the model. In contrast, Dropout, Batch Normalization, and Layer Normalization increase the encoding length of features by adopting redundant representation to improve the performance. (2) The encoding of neural network does not satisfy the data processing inequality of information theory, which is mainly caused by redundant coding of extracted features. (3) The overfitting is caused by introducing irrelative information with the target. These results can give us insight into building more efficient regularization algorithm to improve the performance of neural network model.

[1]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[2]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  H. B. Barlow,et al.  Finding Minimum Entropy Codes , 1989, Neural Computation.

[5]  H. Barlow The exploitation of regularities in the environment by the brain. , 2001, The Behavioral and brain sciences.

[6]  Zhi-Hua Zhou,et al.  Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble , 2003, IEEE Transactions on Information Technology in Biomedicine.

[7]  Shuai Li,et al.  New Disturbance Rejection Constraint for Redundant Robot Manipulators: An Optimization Perspective , 2020, IEEE Transactions on Industrial Informatics.

[8]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[9]  Faa-Jeng Lin,et al.  New Super-Twisting Zeroing Neural-Dynamics Model for Tracking Control of Parallel Robots: A Finite-Time and Robust Solution , 2020, IEEE Transactions on Cybernetics.

[10]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[11]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[12]  Xiaoming Xi,et al.  Fast and effective optic disk localization based on convolutional neural network , 2018, Neurocomputing.

[13]  Guang Shi,et al.  Fast Inference Predictive Coding: A Novel Model for Constructing Deep Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[15]  Masumi Ishikawa,et al.  Structural learning with forgetting , 1996, Neural Networks.

[16]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[19]  Lutz Prechelt,et al.  Automatic early stopping using cross validation: quantifying the criteria , 1998, Neural Networks.

[20]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[21]  Deyu Meng,et al.  Enhancing performance of the backpropagation algorithm via sparse response regularization , 2015, Neurocomputing.

[22]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[25]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[26]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.