Boosting in the Presence of Outliers: Adaptive Classification With Nonconvex Loss Functions

ABSTRACT This article examines the role and the efficiency of nonconvex loss functions for binary classification problems. In particular, we investigate how to design adaptive and effective boosting algorithms that are robust to the presence of outliers in the data or to the presence of errors in the observed data labels. We demonstrate that nonconvex losses play an important role for prediction accuracy because of the diminishing gradient properties—the ability of the losses to efficiently adapt to the outlying data. We propose a new boosting framework called ArchBoost that uses diminishing gradient property directly and leads to boosting algorithms that are provably robust. Along with the ArchBoost framework, a family of nonconvex losses is proposed, which leads to the new robust boosting algorithms, named adaptive robust boosting (ARB). Furthermore, we develop a new breakdown point analysis and a new influence function analysis that demonstrate gains in robustness. Moreover, based only on local curvatures, we establish statistical and optimization properties of the proposed ArchBoost algorithms with highly nonconvex losses. Extensive numerical and real data examples illustrate theoretical properties and reveal advantages over the existing boosting methods when data are perturbed by an adversary or otherwise. Supplementary materials for this article are available online.

[1]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[2]  L. J. Savage Elicitation of Personal Probabilities and Expectations , 1971 .

[3]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[4]  Rocco A. Servedio,et al.  Smooth boosting and learning with malicious noise , 2003 .

[5]  P. L. Davies,et al.  Breakdown and groups , 2005, math/0508497.

[6]  Peter Bühlmann,et al.  Robustified L2 boosting , 2008, Comput. Stat. Data Anal..

[7]  Rocco A. Servedio,et al.  Boosting in the presence of noise , 2003, STOC '03.

[8]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[9]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[10]  L. Breiman Population theory for boosting ensembles , 2003 .

[11]  Giorgio Giorgi,et al.  Invex Functions (The Smooth Case) , 2008 .

[12]  Richard Nock,et al.  A Robust Boosting Algorithm , 2002, ECML.

[13]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[14]  David E. Tyler Finite Sample Breakdown Points of Projection Based Multivariate Location and Scatter Statistics , 1994 .

[15]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[16]  Saharon Rosset Robust boosting and its relation to bagging , 2005, KDD '05.

[17]  D. Ruppert,et al.  Breakdown in Nonlinear Regression , 1992 .

[18]  Nuno Vasconcelos,et al.  On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost , 2008, NIPS.

[19]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[20]  J. Brian Gray,et al.  Noise peeling methods to improve boosting algorithms , 2016, Comput. Stat. Data Anal..

[21]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[22]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[23]  Yoav Freund,et al.  A more robust boosting algorithm , 2009, 0905.2138.

[24]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[25]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[26]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[27]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[28]  Nicolò Cesa-Bianchi,et al.  Online Learning of Noisy Data , 2011, IEEE Transactions on Information Theory.

[29]  Marc G. Genton,et al.  Comprehensive definitions of breakdown points for independent and dependent observations , 2003 .

[30]  Andreas Ruckstuhl,et al.  ROBUST FITTING OF THE BINOMIAL MODEL , 2001 .

[31]  L A Stefanski,et al.  Variable Selection in Nonparametric Classification Via Measurement Error Model Selection Likelihoods , 2014, Journal of the American Statistical Association.

[32]  Maoguo Gong,et al.  RBoost: Label Noise-Robust Boosting Algorithm Based on a Nonconvex Loss Function and the Numerically Stable Base Learners , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[33]  Robert E. Schapire,et al.  Explaining AdaBoost , 2013, Empirical Inference.

[34]  Runze Li,et al.  Variable selection for support vector machines in moderately high dimensions , 2016, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[35]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[36]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[37]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[38]  Rocco A. Servedio,et al.  Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[39]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[40]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[41]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[42]  C. Hennig Breakdown points for maximum likelihood estimators of location–scale mixtures , 2004, math/0410073.

[43]  Robert E. Schapire,et al.  Theoretical Views of Boosting and Applications , 1999, ALT.

[44]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[45]  David W. Opitz,et al.  An Empirical Evaluation of Bagging and Boosting , 1997, AAAI/IAAI.

[46]  Josephine Sullivan,et al.  Improved Boosting Performance by Exclusion of Ambiguous Positive Examples , 2013, ICPRAM.

[47]  Andreas Christmann,et al.  On Robustness Properties of Convex Risk Minimization Methods for Pattern Recognition , 2004, J. Mach. Learn. Res..

[48]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[49]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[50]  F. Hampel The Influence Curve and Its Role in Robust Estimation , 1974 .

[51]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[52]  Nick Littlestone,et al.  Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow , 1991, COLT '91.

[53]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[54]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[55]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[56]  J. Hilbe Logistic Regression Models , 2009 .

[57]  L. Breiman Arcing Classifiers , 1998 .

[58]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[59]  B. M. Glover,et al.  Invex functions and duality , 1985, Journal of the Australian Mathematical Society. Series A. Pure Mathematics and Statistics.

[60]  Yoav Freund,et al.  An Adaptive Version of the Boost by Majority Algorithm , 1999, COLT '99.

[61]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[62]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[63]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[64]  Adi Ben-Israel,et al.  What is invexity? , 1986, The Journal of the Australian Mathematical Society. Series B. Applied Mathematics.

[65]  Ingo Steinwart,et al.  Consistency and robustness of kernel-based regression in convex risk minimization , 2007, 0709.0626.

[66]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[67]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[68]  Steffen Uhlig,et al.  Estimation of variance components with high breakdown point and high efficiency , 2001 .

[69]  F. Hampel Contributions to the theory of robust estimation , 1968 .

[70]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[71]  Quaid Morris,et al.  PLIDA: cross-platform gene expression normalization using perturbed topic models , 2014, Bioinform..

[72]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[73]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[74]  Takafumi Kanamori,et al.  Robust Loss Functions for Boosting , 2007, Neural Computation.