Regularizing Neural Networks via Retaining Confident Connections

Regularization of neural networks can alleviate overfitting in the training phase. Current regularization methods, such as Dropout and DropConnect, randomly drop neural nodes or connections based on a uniform prior. Such a data-independent strategy does not take into consideration of the quality of individual unit or connection. In this paper, we aim to develop a data-dependent approach to regularizing neural network in the framework of Information Geometry. A measurement for the quality of connections is proposed, namely confidence. Specifically, the confidence of a connection is derived from its contribution to the Fisher information distance. The network is adjusted by retaining the confident connections and discarding the less confident ones. The adjusted network, named as ConfNet, would carry the majority of variations in the sample data. The relationships among confidence estimation, Maximum Likelihood Estimation and classical model selection criteria (like Akaike information criterion) is investigated and discussed theoretically. Furthermore, a Stochastic ConfNet is designed by adding a self-adaptive probabilistic sampling strategy. The proposed data-dependent regularization methods achieve promising experimental results on three data collections including MNIST, CIFAR-10 and CIFAR-100.

[1]  Dawei Song,et al.  Extending the Extreme Physical Information to Universal Cognitive Models via a Confident Information First Principle , 2014, Entropy.

[2]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[3]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[4]  Shun-ichi Amari,et al.  Information geometry on hierarchy of probability distributions , 2001, IEEE Trans. Inf. Theory.

[5]  Donald F. Specht,et al.  Probabilistic neural networks , 1990, Neural Networks.

[6]  R. Kass The Geometry of Asymptotic Inference , 1989 .

[7]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[9]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[10]  Henry P. Wynn,et al.  Algebraic and geometric methods in statistics , 2009 .

[11]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[12]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[13]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[14]  Dawei Song,et al.  Mining pure high-order word associations via information geometry for information retrieval , 2013, TOIS.

[15]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[16]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[17]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[18]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[22]  H. Akaike A new look at the statistical model identification , 1974 .

[23]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[24]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[25]  Shun-ichi Amari,et al.  Information-Geometric Measure for Neural Spikes , 2002, Neural Computation.

[26]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[27]  N. Čencov Statistical Decision Rules and Optimal Inference , 2000 .

[28]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Wenjie Li,et al.  A Confident Information First Principle for Parameter Reduction and Model Selection of Boltzmann Machines , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.