An empirical analysis of dropout in piecewise linear networks

The recently introduced dropout training criterion for neural networks has been the subject of much attention due to its simplicity and remarkable effectiveness as a regularizer, as well as its interpretation as a training procedure for an exponentially large ensemble of networks that share parameters. In this work we empirically investigate several questions related to the efficacy of dropout, specifically as it concerns networks employing the popular rectified linear activation function. We investigate the quality of the test time weight-scaling inference procedure by evaluating the geometric average exactly in small models, as well as compare the performance of the geometric mean to the arithmetic mean more commonly employed by ensemble techniques. We explore the effect of tied weights on the ensemble interpretation by training ensembles of masked networks without tied weights. Finally, we investigate an alternative criterion based on a biased estimator of the maximum likelihood ensemble gradient.

[1]  Razvan Pascanu,et al.  M L ] 2 0 A ug 2 01 3 Pylearn 2 : a machine learning research library , 2014 .

[2]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[3]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[4]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[6]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[7]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[9]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[10]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[11]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[12]  M. Field,et al.  A refinement of the arithmetic mean-geometric mean inequality , 1978 .

[13]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[14]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[15]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[16]  Rob Fergus,et al.  Stochastic Pooling for Regularization of Deep Convolutional Neural Networks , 2013, ICLR.

[17]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[18]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[19]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[20]  Nitish Srivastava,et al.  Improving Neural Networks with Dropout , 2013 .

[21]  Pascal Vincent,et al.  The Manifold Tangent Classifier , 2011, NIPS.

[22]  Christopher M. Bishop,et al.  Training with Noise is Equivalent to Tikhonov Regularization , 1995, Neural Computation.