Dropout training for SVMs with data augmentation

Dropout and other feature noising schemes have shown promise in controlling over-fitting by artificially corrupting the training data. Though extensive studies have been performed for generalized linear models, little has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for both linear SVMs and the nonlinear extension with latent representation learning. For linear SVMs, to deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a reweighted least square problem, where the re-weights are analytically updated. For nonlinear latent SVMs, we consider learning one layer of latent representations in SVMs and extend the data augmentation technique in conjunction with first-order Taylor-expansion to deal with the intractable expected hinge loss and the nonlinearity of latent representations. Finally, we apply the similar data augmentation ideas to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions, and we further develop a non-linear extension of logistic regression by incorporating one layer of latent representations. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of both linear and nonlinear SVMs.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Wojciech Kotlowski,et al.  Follow the Leader with Dropout Perturbations , 2014, COLT.

[3]  Philip Bachman,et al.  Learning with Pseudo-Ensembles , 2014, NIPS.

[4]  Amir Globerson,et al.  Nightmare at test time: robust learning by feature deletion , 2006, ICML.

[5]  Pierre Baldi,et al.  The dropout learning algorithm , 2014, Artif. Intell..

[6]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[7]  Kilian Q. Weinberger,et al.  Fast Image Tagging , 2013, ICML.

[8]  L. Devroye Non-Uniform Random Variate Generation , 1986 .

[9]  Dipak Panigrahy Biographies , 2018, Cancer and Metastasis Reviews.

[10]  Ruhi Sarikaya,et al.  Targeted feature dropout for robust slot filling in natural language understanding , 2014, INTERSPEECH.

[11]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[12]  Sida I. Wang,et al.  Altitude Training: Strong Bounds for Single-Layer Dropout , 2014, NIPS.

[13]  Eric P. Xing,et al.  Conditional Topic Random Fields , 2010, ICML.

[14]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[15]  Nitish Srivastava,et al.  Improving Neural Networks with Dropout , 2013 .

[16]  Ran Gilad-Bachrach,et al.  DART: Dropouts meet Multiple Additive Regression Trees , 2015, AISTATS.

[17]  Pierre-Luc Bacon Conditional computation in neural networks using a decision-theoretic approach , 2015 .

[18]  Weixiong Zhang,et al.  Marginalized Denoising for Link Prediction and Multi-Label Learning , 2015, AAAI.

[19]  Dit-Yan Yeung,et al.  Relational Stacked Denoising Autoencoder for Tag Recommendation , 2015, AAAI.

[20]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[21]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[22]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[23]  Shin-ichi Maeda,et al.  A Bayesian encourages dropout , 2014, ArXiv.

[24]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Jian Pei,et al.  Distance metric learning using dropout: a structured regularization approach , 2014, KDD.

[26]  Ning Chen,et al.  Generalized Relational Topic Models with Data Augmentation , 2013, IJCAI.

[27]  Ning Chen,et al.  Dropout Training for Support Vector Machines , 2014, AAAI.

[28]  Xavier Carreras,et al.  Exponentiated gradient algorithms for log-linear structured prediction , 2007, ICML '07.

[29]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[30]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[31]  Philip M. Long,et al.  On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[32]  David Yarowsky,et al.  A Distributed Representation-Based Framework for Cross-Lingual Transfer Parsing , 2016, J. Artif. Intell. Res..

[33]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[34]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[35]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[36]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[37]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[39]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[40]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[41]  Nicholas G. Polson,et al.  Data augmentation for support vector machines , 2011 .

[42]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[43]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[44]  Ning Chen,et al.  Gibbs max-margin topic models with data augmentation , 2013, J. Mach. Learn. Res..

[45]  Ohad Shamir,et al.  Learning to classify with missing and corrupted features , 2008, ICML '08.

[46]  Christopher D. Manning,et al.  Feature Noising for Log-Linear Structured Prediction , 2013, EMNLP.

[47]  Cheng Wu,et al.  Semi-Supervised and Unsupervised Extreme Learning Machines , 2014, IEEE Transactions on Cybernetics.

[48]  Alexander J. Smola,et al.  Convex Learning with Invariances , 2007, NIPS.

[49]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[50]  Weixiong Zhang,et al.  A Marginalized Denoising Method for Link Prediction in Relational Data , 2014, SDM.

[51]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[52]  Max Welling,et al.  Efficient Gradient-Based Inference through Transformations between Bayes Nets and Neural Nets , 2014, ICML.

[53]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.