Fast “dropout” training for logistic regression

Recently, improved classification performance has been achieved by encouraging independent contributions from input features, or equivalently, by preventing feature co-adaptation. In particular, the method proposed in [1], informally called “dropout”, did this by randomly dropping out (zeroing) hidden units and input features for training neural networks. However, sampling a random subset of input features during training makes training much slower. We look at the implied objective function of the dropout training method in the context of logistic regression. Then, instead of doing a Monte Carlo optimization of this objective as in [1], we show how to optimize it more directly by using a Gaussian approximation justified by the central limit theorem and empirical evidence, resulting in a 2-30 times speedup and more stability when it is applicable. We outline potential ways of extending the Gaussian approximation to neural networks and draw some connections to other methods in the literature. Finally, we empirically compare the performance of this method to previously published results, and to baselines. Code to replicate results in this paper will be made available.

[1]  Jeffrey K. Uhlmann,et al.  New extension of the Kalman filter to nonlinear systems , 1997, Defense, Security, and Sensing.

[2]  E. Lehmann Elements of large-sample theory , 1998 .

[3]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[4]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[5]  Bing Liu,et al.  Mining and summarizing customer reviews , 2004, KDD.

[6]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[7]  Bo Pang,et al.  Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales , 2005, ACL.

[8]  Andrew McCallum,et al.  Reducing Weight Undertraining in Structured Discriminative Learning , 2006, NAACL.

[9]  Kentaro Inui,et al.  Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables , 2010, NAACL.

[10]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[11]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[12]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[13]  Christopher D. Manning,et al.  Baselines and Bigrams: Simple, Good Sentiment and Topic Classification , 2012, ACL.