Marginalizing Corrupted Features

The goal of machine learning is to develop predictors that generalize well to test data. Ideally, this is achieved by training on an almost infinitely large training data set that captures all variations in the data distribution. In practical learning settings, however, we do not have infinite data and our predictors may overfit. Overfitting may be combatted, for example, by adding a regularizer to the training objective or by defining a prior over the model parameters and performing Bayesian inference. In this paper, we propose a third, alternative approach to combat overfitting: we extend the training set with infinitely many artificial training examples that are obtained by corrupting the original training data. We show that this approach is practical and efficient for a range of predictors and corruption models. Our approach, called marginalized corrupted features (MCF), trains robust predictors by minimizing the expected value of the loss function under the corruption model. We show empirically on a variety of data sets that MCF classifiers can be trained efficiently, may generalize substantially better to test data, and are also more robust to feature deletion at test time.

[1]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[2]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Trevor Hastie,et al.  Multi-class AdaBoost ∗ , 2009 .

[5]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[6]  Christopher D. Manning,et al.  Feature Noising for Log-Linear Structured Prediction , 2013, EMNLP.

[7]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[8]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[9]  Theodore B. Trafalis,et al.  Robust support vector machines for classification and computational issues , 2007, Optim. Methods Softw..

[10]  Marcel H. Schulz,et al.  Probabilistic error correction for RNA sequencing , 2013, Nucleic acids research.

[11]  Koby Crammer,et al.  A Simple Geometric Interpretation of SVM using Stochastic Adversaries , 2012, AISTATS.

[12]  Alexander J. Smola,et al.  A Second Order Cone programming Formulation for Classifying Missing Data , 2004, NIPS.

[13]  Nicolò Cesa-Bianchi,et al.  Online Learning of Noisy Data , 2011, IEEE Transactions on Information Theory.

[14]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[15]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Kilian Q. Weinberger,et al.  Marginalized Denoising Autoencoders for Domain Adaptation , 2012, ICML.

[18]  Ohad Shamir,et al.  Learning to classify with missing and corrupted features , 2008, ICML.

[19]  Thore Graepel,et al.  Invariant Pattern Recognition by Semi-Definite Programming Machines , 2003, NIPS.

[20]  Paul Lamere,et al.  Steerable Playlist Generation by Learning Song Similarity from Radio Station Playlists , 2009, ISMIR.

[21]  Amir Globerson,et al.  Nightmare at test time: robust learning by feature deletion , 2006, ICML.

[22]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[23]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[24]  Francis R. Bach,et al.  Trace Lasso: a trace norm regularization for correlated designs , 2011, NIPS.

[25]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[26]  Yoshua Bengio,et al.  Unsupervised and Transfer Learning Challenge: a Deep Learning Approach , 2011, ICML Unsupervised and Transfer Learning.

[27]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[28]  Alexander J. Smola,et al.  Convex Learning with Invariances , 2007, NIPS.

[29]  Jason Weston,et al.  Vicinal Risk Minimization , 2000, NIPS.

[30]  Geoffrey E. Hinton,et al.  Modeling pixel means and covariances using factorized third-order boltzmann machines , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[32]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[33]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[34]  Andrew McCallum,et al.  Feature Bagging: Preventing Weight Undertraining in Structured Discriminative Learning , 2005 .

[35]  Vladimir Vovk,et al.  A tutorial on conformal prediction , 2007, J. Mach. Learn. Res..

[36]  Yoshua Bengio,et al.  Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach , 2011, ICML.

[37]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[38]  Bernhard Schölkopf,et al.  Estimating a Kernel Fisher Discriminant in the Presence of Label Noise , 2001, ICML.

[39]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[40]  Jude W. Shavlik,et al.  Training Knowledge-Based Neural Networks to Recognize Genes , 1990, NIPS.

[41]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[42]  Ohad Shamir,et al.  Online Learning of Noisy Data with Kernels , 2010, COLT 2010.

[43]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[44]  J. Andrew Bagnell,et al.  SpeedBoost: Anytime Prediction with Uniform Near-Optimality , 2012, AISTATS.

[45]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[46]  Kilian Q. Weinberger,et al.  Fast Image Tagging , 2013, ICML.

[47]  Pieter Abbeel,et al.  Max-margin Classification of Data with Absent Features , 2008, J. Mach. Learn. Res..

[48]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[49]  Peter L. Bartlett,et al.  Learning with Missing Features , 2011, UAI.

[50]  Alexander J. Smola,et al.  Second Order Cone Programming Approaches for Handling Missing and Uncertain Data , 2006, J. Mach. Learn. Res..