论文信息 - Dropout Training as Adaptive Regularization - 字舞流文

Dropout Training as Adaptive Regularization

Dropout and other feature noising schemes control overfitting by artificially corrupting the training data. For generalized linear models, dropout performs a form of adaptive regularization. Using this viewpoint, we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix. We also establish a connection to AdaGrad, an online learning algorithm, and find that a close relative of AdaGrad operates by repeatedly solving linear dropout-regularized problems. By casting dropout as regularization, we develop a natural semi-supervised algorithm that uses unlabeled data to create a better adaptive regularizer. We apply this idea to document classification tasks, and show that it consistently boosts the performance of dropout training, improving on state-of-the-art results on the IMDB reviews dataset.

Sida I. Wang | Stefan Wager | Percy Liang | Percy Liang | Stefan Wager

[1] E. L. Lehmann,et al. Theory of point estimation , 1950 .

[2] Yaser S. Abu-Mostafa,et al. Learning from hints in neural networks , 1990, J. Complex..

[3] Kiyotoshi Matsuoka,et al. Noise injection into inputs in back-propagation learning , 1992, IEEE Trans. Syst. Man Cybern..

[4] Christopher M. Bishop,et al. Current address: Microsoft Research, , 2022 .

[5] Bernhard Schölkopf,et al. Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[6] Thorsten Joachims,et al. Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[7] Yann LeCun,et al. Transformation invariance in pattern recognition: Tangent distance and propagation , 2000, Int. J. Imaging Syst. Technol..

[8] Rajat Raina,et al. Classification with Hybrid Generative/Discriminative Models , 2003, NIPS.

[9] Guillaume Bouchard,et al. The Tradeoff Between Generative and Discriminative Classifiers , 2004 .

[10] Yoshua Bengio,et al. Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[11] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[12] Yoshua Bengio,et al. Entropy Regularization , 2006, Semi-Supervised Learning.

[13] Jun Suzuki,et al. Semi-Supervised Structured Output Learning Based on a Hybrid Generative and Discriminative Approach , 2007, EMNLP.

[14] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[15] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[16] Stan Matwin,et al. Large Scale Text Classification using Semisupervised Multinomial Naive Bayes , 2011, ICML.

[17] Pascal Vincent,et al. Adding noise to the input of a model trained with a regularized objective , 2011, ArXiv.

[18] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[19] Pascal Vincent,et al. The Manifold Tangent Classifier , 2011, NIPS.

[20] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[21] Christopher D. Manning,et al. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.

[22] Christopher D. Manning,et al. Feature Noising for Log-Linear Structured Prediction , 2013, EMNLP.

[23] Stephen Tyree,et al. Learning with Marginalized Corrupted Features , 2013, ICML.

[24] Yoshua Bengio,et al. Maxout Networks , 2013, ICML.

[25] Christopher D. Manning,et al. Fast dropout training , 2013, ICML.

[26] Koby Crammer,et al. Adaptive regularization of weight vectors , 2009, Machine Learning.