A view of margin losses as regularizers of probability estimates

Regularization is commonly used in classifier design, to assure good generalization. Classical regularization enforces a cost on classifier complexity, by constraining parameters. This is usually combined with a margin loss, which favors large-margin decision rules. A novel and unified view of this architecture is proposed, by showing that margin losses act as regularizers of posterior class probabilities, in a way that amplifies classical parameter regularization. The problem of controlling the regularization strength of a margin loss is considered, using a decomposition of the loss in terms of a link and a binding function. The link function is shown to be responsible for the regularization strength of the loss, while the binding function determines its outlier robustness. A large class of losses is then categorized into equivalence classes of identical regularization strength or outlier robustness. It is shown that losses in the same regularization class can be parameterized so as to have tunable regularization strength. This parameterization is finally used to derive boosting algorithms with loss regularization (BoostLR). Three classes of tunable regularization losses are considered in detail. Canonical losses can implement all regularization behaviors but have no flexibility in terms of outlier modeling. Shrinkage losses support equally parameterized link and binding functions, leading to boosting algorithms that implement the popular shrinkage procedure. This offers a new explanation for shrinkage as a special case of loss-based regularization. Finally, α-tunable losses enable the independent parameterization of link and binding functions, leading to boosting algorithms of great exibility. This is illustrated by the derivation of an algorithm that generalizes both AdaBoost and LogitBoost, behaving as either one when that best suits the data to classify. Various experiments provide evidence of the benefits of probability regularization for both classification and estimation of posterior class probabilities.

[1]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[2]  P. Bickel,et al.  Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[3]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[4]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[5]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[6]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[7]  Guodong Guo,et al.  Support Vector Machines Applications , 2014 .

[8]  L. J. Savage Elicitation of Personal Probabilities and Expectations , 1971 .

[9]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[10]  Horst Bischof,et al.  On robustness of on-line boosting - a competitive study , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[11]  A. Buja,et al.  Loss Functions for Binary Class Probability Estimation and Classification: Structure and Applications , 2005 .

[12]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[13]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[14]  Sanjeev R. Kulkarni,et al.  Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations , 2005, NIPS.

[15]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[16]  Peter J. Ramadge,et al.  Boosting with Spatial Regularization , 2009, NIPS.

[17]  David Mease,et al.  Evidence Contrary to the Statistical View of Boosting , 2008, J. Mach. Learn. Res..

[18]  Nuno Vasconcelos,et al.  Cost-Sensitive Boosting , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[20]  Bo Wu,et al.  Fast rotation invariant multi-view face detection based on real Adaboost , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[21]  Martin J. Wainwright,et al.  Early stopping for non-parametric regression: An optimal data-dependent stopping rule , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[22]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[23]  Rich Caruana,et al.  Obtaining Calibrated Probabilities from Boosting , 2005, UAI.

[24]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[25]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[26]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[27]  Johan A. K. Suykens,et al.  Support Vector Machine Classifier With Pinball Loss , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[29]  R. Nevatia,et al.  Simultaneous Object Detection and Segmentation by Boosting Local Shape Feature based Classifier , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Yuichi Shiraishi,et al.  Statistical approaches to combining binary classifiers for multi-class classification , 2011, Neurocomputing.

[31]  George Michailidis,et al.  On Adaptive Regularization Methods in Boosting , 2011 .

[32]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[33]  Luo Si,et al.  A New Boosting Algorithm Using Input-Dependent Regularizer , 2003, ICML 2003.

[34]  Sanjeev R. Kulkarni,et al.  Convergence and Consistency of Regularized Boosting With Weakly Dependent Observations , 2014, IEEE Transactions on Information Theory.

[35]  Ke Chen,et al.  Regularized Boost for Semi-Supervised Learning , 2007, NIPS.

[36]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[37]  L. Petersson,et al.  Response Binning: Improved Weak Classifiers for Boosting , 2006, 2006 IEEE Intelligent Vehicles Symposium.

[38]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[39]  David J. Hand,et al.  An Empirical Comparison of Three Boosting Algorithms on Real Data Sets with Artificial Class Noise , 2003, Multiple Classifier Systems.

[40]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[41]  Gilles Blanchard,et al.  On the Rate of Convergence of Regularized Boosting Classifiers , 2003, J. Mach. Learn. Res..

[42]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[43]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[44]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[45]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[46]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[47]  Ji Zhu,et al.  Boosting as a Regularized Path to a Maximum Margin Classifier , 2004, J. Mach. Learn. Res..

[48]  David W. Opitz,et al.  An Empirical Evaluation of Bagging and Boosting , 1997, AAAI/IAAI.

[49]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[50]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[51]  Shai Avidan,et al.  Ensemble Tracking , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[53]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[54]  Baba C. Vemuri,et al.  Robust and efficient regularized boosting using total Bregman divergence , 2011, CVPR 2011.

[55]  Ke Chen,et al.  Semi-Supervised Learning via Regularized Boosting Working on Multiple Semi-Supervised Assumptions , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Ethem Alpaydin,et al.  Multiclass Posterior Probability Support Vector Machines , 2008, IEEE Transactions on Neural Networks.

[57]  B. Zadrozny Reducing multiclass to binary by coupling probability estimates , 2001, NIPS.

[58]  Sriraam Natarajan,et al.  AR-Boost: Reducing Overfitting by a Robust Data-Driven Regularization Strategy , 2013, ECML/PKDD.

[59]  Nuno Vasconcelos,et al.  On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost , 2008, NIPS.

[60]  Alois Knoll,et al.  Gradient boosting machines, a tutorial , 2013, Front. Neurorobot..

[61]  Robert E. Schapire,et al.  Speed and Sparsity of Regularized Boosting , 2009, AISTATS.

[62]  VasconcelosNuno,et al.  A view of margin losses as regularizers of probability estimates , 2015 .