Dropout Training as Adaptive Regularization

Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  Yaser S. Abu-Mostafa,et al.  Learning from hints in neural networks , 1990, J. Complex..

[3]  Kiyotoshi Matsuoka,et al.  Noise injection into inputs in back-propagation learning , 1992, IEEE Trans. Syst. Man Cybern..

[4]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[5]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[6]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[7]  Yann LeCun,et al.  Transformation invariance in pattern recognition: Tangent distance and propagation , 2000, Int. J. Imaging Syst. Technol..

[8]  Rajat Raina,et al.  Classification with Hybrid Generative/Discriminative Models , 2003, NIPS.

[9]  Guillaume Bouchard,et al.  The Tradeoff Between Generative and Discriminative Classifiers , 2004 .

[10]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[11]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[12]  Yoshua Bengio,et al.  Entropy Regularization , 2006, Semi-Supervised Learning.

[13]  Jun Suzuki,et al.  Semi-Supervised Structured Output Learning Based on a Hybrid Generative and Discriminative Approach , 2007, EMNLP.

[14]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[16]  Stan Matwin,et al.  Large Scale Text Classification using Semisupervised Multinomial Naive Bayes , 2011, ICML.

[17]  Pascal Vincent,et al.  Adding noise to the input of a model trained with a regularized objective , 2011, ArXiv.

[18]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[19]  Pascal Vincent,et al.  The Manifold Tangent Classifier , 2011, NIPS.

[20]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[21]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[22]  Christopher D. Manning,et al.  Feature Noising for Log-Linear Structured Prediction , 2013, EMNLP.

[23]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[24]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[25]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[26]  Koby Crammer,et al.  Adaptive regularization of weight vectors , 2009, Machine Learning.