What killed the Convex Booster ?

A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high preci-sion machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for fitting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio’s results and a new general boosting algorithm to demonstrate that the real culprit in their specific context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive – and otherwise prized – aspect of ML: parameterisation .

[1]  Sheng-Jun Huang,et al.  CCMN: A General Framework for Learning With Class-Conditional Multi-Label Noise , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  R. Nock,et al.  Being Properly Improper , 2021, ICML.

[3]  Christos Tzamos,et al.  Boosting in the Presence of Massart Noise , 2021, COLT.

[4]  Yang Liu,et al.  A Second-Order Approach to Learning with Instance-Dependent Label Noise , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  N. Alon,et al.  Boosting simple learners , 2020, STOC.

[6]  Kunal Talwar On the Error Resistance of Hinge Loss Minimization , 2020, NeurIPS.

[7]  Masashi Sugiyama,et al.  Calibrated Surrogate Losses for Adversarially Robust Classification , 2020, COLT.

[8]  Sebastian Pokutta,et al.  IPBoost - Non-Convex Boosting via Integer Programming , 2020, ICML.

[9]  Richard Nock,et al.  Supervised Learning: No Loss No Cry , 2020, ICML.

[10]  Mark Bun,et al.  Efficient, Noise-Tolerant, and Private Learning via Boosting , 2020, COLT 2020.

[11]  Kotagiri Ramamohanarao,et al.  Learning with Bounded Instance- and Label-dependent Label Noise , 2017, ICML.

[12]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Eric Eaton,et al.  Building more accurate decision trees with the additive tree , 2019, Proceedings of the National Academy of Sciences.

[14]  Manfred K. Warmuth,et al.  Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[15]  Richard Nock,et al.  Lossless or Quantized Boosting with Integer Arithmetic , 2019, ICML.

[16]  Masashi Sugiyama,et al.  On Symmetric Losses for Learning from Corrupted Labels , 2019, ICML.

[17]  Sandhya Tripathi,et al.  Cost Sensitive Learning in the Presence of Symmetric Label Noise , 2019, PAKDD.

[18]  Manfred K. Warmuth,et al.  Two-temperature logistic regression based on the Tsallis divergence , 2017, AISTATS.

[19]  Percy Liang,et al.  Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss , 2018, NeurIPS.

[20]  Aditya Krishna Menon,et al.  The risk of trivial solutions in bipartite top ranking , 2018, Machine Learning.

[21]  Nagarajan Natarajan,et al.  Learning from binary labels with instance-dependent noise , 2018, Machine Learning.

[22]  Aritra Ghosh,et al.  Robust Loss Functions under Label Noise for Deep Neural Networks , 2017, AAAI.

[23]  Aritra Ghosh,et al.  On the Robustness of Decision Tree Learning Under Label Noise , 2017, PAKDD.

[24]  Lior Wolf,et al.  Learning to Count with CNN Boosting , 2016, ECCV.

[25]  Gilles Louppe,et al.  Lectures on Machine Learning , 2016 .

[26]  Lu Wang,et al.  Risk Minimization in the Presence of Label Noise , 2016, AAAI.

[27]  Maria-Florina Balcan,et al.  Communication Efficient Distributed Agnostic Boosting , 2015, AISTATS.

[28]  Mark D. Reid,et al.  Composite Multiclass Losses , 2011, J. Mach. Learn. Res..

[29]  J. Eichel Comparison Of Statistical Experiments , 2016 .

[30]  Alexander Hanbo Li,et al.  Boosting in the Presence of Outliers: Adaptive Classification With Nonconvex Loss Functions , 2015, ArXiv.

[31]  Mark D. Reid,et al.  Fast rates in statistical and online learning , 2015, J. Mach. Learn. Res..

[32]  Aditya Krishna Menon,et al.  An Average Classification Algorithm , 2015, ArXiv.

[33]  Aditya Krishna Menon,et al.  Learning with Symmetric Label Noise: The Importance of Being Unhinged , 2015, NIPS.

[34]  Matthieu Geist,et al.  Soft-max boosting , 2015, Machine Learning.

[35]  Frank Nielsen,et al.  Gentle Nearest Neighbors Boosting over Proper Scoring Rules , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Koby Crammer,et al.  Robust Forward Algorithms via PAC-Bayes and Laplace Distributions , 2014, AISTATS.

[37]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[38]  Scott Sanner,et al.  Algorithms for Direct 0-1 Loss Optimization in Binary Classification , 2013, ICML.

[39]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[40]  Robert E. Schapire,et al.  A theory of multiclass boosting , 2010, J. Mach. Learn. Res..

[41]  Robert E. Schapire,et al.  Explaining AdaBoost , 2013, Empirical Inference.

[42]  Tibério S. Caetano,et al.  Learning as MAP Inference in Discrete Graphical Models , 2012, NIPS.

[43]  Nathan Srebro,et al.  Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss , 2012, ICML.

[44]  Matus Telgarsky,et al.  A Primal-Dual Convergence Analysis of Boosting , 2011, J. Mach. Learn. Res..

[45]  Mark D. Reid,et al.  Mixability is Bayes Risk Curvature Relative to Log Loss , 2011, COLT.

[46]  Rocco A. Servedio,et al.  Learning large-margin halfspaces with more malicious noise , 2011, NIPS.

[47]  Mark D. Reid,et al.  Information, Divergence and Risk for Binary Experiments , 2009, J. Mach. Learn. Res..

[48]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[49]  S. V. N. Vishwanathan,et al.  T-logistic Regression , 2010, NIPS.

[50]  Horst Bischof,et al.  Online multi-class LPBoost , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  Adam Tauman Kalai,et al.  Potential-Based Agnostic Boosting , 2009, NIPS.

[52]  Nachum Dershowitz,et al.  When are Two Algorithms the Same? , 2008, The Bulletin of Symbolic Logic.

[53]  Rocco A. Servedio,et al.  Adaptive Martingale Boosting , 2008, NIPS.

[54]  Frank Nielsen,et al.  On the Efficient Minimization of Classification Calibrated Surrogates , 2008, NIPS.

[55]  Rocco A. Servedio,et al.  Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[56]  Barry Mazur,et al.  Proof and other Dilemmas: When is One Thing Equal to Some Other Thing? , 2008 .

[57]  Frank Nielsen,et al.  A Real generalization of discrete AdaBoost , 2006, Artif. Intell..

[58]  Rocco A. Servedio,et al.  Boosting in the presence of noise , 2003, STOC '03.

[59]  Yishay Mansour,et al.  Boosting Using Branching Programs , 2000, J. Comput. Syst. Sci..

[60]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[61]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[62]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[63]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[64]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[65]  Yishay Mansour,et al.  On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[66]  Robert C. Williamson,et al.  Local minima and attractors at infinity for gradient descent learning algorithms , 1996 .

[67]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[68]  Uwe Helmke,et al.  Existence and uniqueness results for neural network approximations , 1995, IEEE Trans. Neural Networks.

[69]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[70]  Donald E. Knuth Two notes on notation , 1992 .

[71]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[72]  Leslie G. Valiant,et al.  On the learnability of Boolean formulae , 1987, STOC.

[73]  L. J. Savage Elicitation of Personal Probabilities and Expectations , 1971 .

[74]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[75]  E. H. Shuford,et al.  Admissible probability measurement procedures , 1966, Psychometrika.

[76]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .