论文信息 - What killed the Convex Booster ?

What killed the Convex Booster ?

A landmark negative result of Long and Servedio established a worst-case spectacular failure of a supervised learning trio (loss, algorithm, model) otherwise praised for its high preci-sion machinery. Hundreds of papers followed up on the two suspected culprits: the loss (for being convex) and/or the algorithm (for ﬁtting a classical boosting blueprint). Here, we call to the half-century+ founding theory of losses for class probability estimation (properness), an extension of Long and Servedio’s results and a new general boosting algorithm to demonstrate that the real culprit in their speciﬁc context was in fact the (linear) model class. We advocate for a more general stanpoint on the problem as we argue that the source of the negative result lies in the dark side of a pervasive – and otherwise prized – aspect of ML: parameterisation .

R. C. Williamson | Y. Mansour | R. Nock

[1] Sheng-Jun Huang,et al. CCMN: A General Framework for Learning With Class-Conditional Multi-Label Noise , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] R. Nock,et al. Being Properly Improper , 2021, ICML.

[3] Christos Tzamos,et al. Boosting in the Presence of Massart Noise , 2021, COLT.

[4] Yang Liu,et al. A Second-Order Approach to Learning with Instance-Dependent Label Noise , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] N. Alon,et al. Boosting simple learners , 2020, STOC.

[6] Kunal Talwar. On the Error Resistance of Hinge Loss Minimization , 2020, NeurIPS.

[7] Masashi Sugiyama,et al. Calibrated Surrogate Losses for Adversarially Robust Classification , 2020, COLT.

[8] Sebastian Pokutta,et al. IPBoost - Non-Convex Boosting via Integer Programming , 2020, ICML.

[9] Richard Nock,et al. Supervised Learning: No Loss No Cry , 2020, ICML.

[10] Mark Bun,et al. Efficient, Noise-Tolerant, and Private Learning via Boosting , 2020, COLT 2020.

[11] Kotagiri Ramamohanarao,et al. Learning with Bounded Instance- and Label-dependent Label Noise , 2017, ICML.

[12] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Eric Eaton,et al. Building more accurate decision trees with the additive tree , 2019, Proceedings of the National Academy of Sciences.

[14] Manfred K. Warmuth,et al. Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[15] Richard Nock,et al. Lossless or Quantized Boosting with Integer Arithmetic , 2019, ICML.

[16] Masashi Sugiyama,et al. On Symmetric Losses for Learning from Corrupted Labels , 2019, ICML.

[17] Sandhya Tripathi,et al. Cost Sensitive Learning in the Presence of Symmetric Label Noise , 2019, PAKDD.

[18] Manfred K. Warmuth,et al. Two-temperature logistic regression based on the Tsallis divergence , 2017, AISTATS.

[19] Percy Liang,et al. Uncertainty Sampling is Preconditioned Stochastic Gradient Descent on Zero-One Loss , 2018, NeurIPS.

[20] Aditya Krishna Menon,et al. The risk of trivial solutions in bipartite top ranking , 2018, Machine Learning.

[21] Nagarajan Natarajan,et al. Learning from binary labels with instance-dependent noise , 2018, Machine Learning.

[22] Aritra Ghosh,et al. Robust Loss Functions under Label Noise for Deep Neural Networks , 2017, AAAI.

[23] Aritra Ghosh,et al. On the Robustness of Decision Tree Learning Under Label Noise , 2017, PAKDD.

[24] Lior Wolf,et al. Learning to Count with CNN Boosting , 2016, ECCV.

[25] Gilles Louppe,et al. Lectures on Machine Learning , 2016 .

[26] Lu Wang,et al. Risk Minimization in the Presence of Label Noise , 2016, AAAI.

[27] Maria-Florina Balcan,et al. Communication Efficient Distributed Agnostic Boosting , 2015, AISTATS.

[28] Mark D. Reid,et al. Composite Multiclass Losses , 2011, J. Mach. Learn. Res..

[29] J. Eichel. Comparison Of Statistical Experiments , 2016 .

[30] Alexander Hanbo Li,et al. Boosting in the Presence of Outliers: Adaptive Classification With Nonconvex Loss Functions , 2015, ArXiv.

[31] Mark D. Reid,et al. Fast rates in statistical and online learning , 2015, J. Mach. Learn. Res..

[32] Aditya Krishna Menon,et al. An Average Classification Algorithm , 2015, ArXiv.

[33] Aditya Krishna Menon,et al. Learning with Symmetric Label Noise: The Importance of Being Unhinged , 2015, NIPS.

[34] Matthieu Geist,et al. Soft-max boosting , 2015, Machine Learning.

[35] Frank Nielsen,et al. Gentle Nearest Neighbors Boosting over Proper Scoring Rules , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Koby Crammer,et al. Robust Forward Algorithms via PAC-Bayes and Laplace Distributions , 2014, AISTATS.

[37] Nagarajan Natarajan,et al. Learning with Noisy Labels , 2013, NIPS.

[38] Scott Sanner,et al. Algorithms for Direct 0-1 Loss Optimization in Binary Classification , 2013, ICML.

[39] Gilles Blanchard,et al. Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[40] Robert E. Schapire,et al. A theory of multiclass boosting , 2010, J. Mach. Learn. Res..

[41] Robert E. Schapire,et al. Explaining AdaBoost , 2013, Empirical Inference.

[42] Tibério S. Caetano,et al. Learning as MAP Inference in Discrete Graphical Models , 2012, NIPS.

[43] Nathan Srebro,et al. Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss , 2012, ICML.

[44] Matus Telgarsky,et al. A Primal-Dual Convergence Analysis of Boosting , 2011, J. Mach. Learn. Res..

[45] Mark D. Reid,et al. Mixability is Bayes Risk Curvature Relative to Log Loss , 2011, COLT.

[46] Rocco A. Servedio,et al. Learning large-margin halfspaces with more malicious noise , 2011, NIPS.

[47] Mark D. Reid,et al. Information, Divergence and Risk for Binary Experiments , 2009, J. Mach. Learn. Res..

[48] Wei-Yin Loh,et al. Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[49] S. V. N. Vishwanathan,et al. T-logistic Regression , 2010, NIPS.

[50] Horst Bischof,et al. Online multi-class LPBoost , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51] Adam Tauman Kalai,et al. Potential-Based Agnostic Boosting , 2009, NIPS.

[52] Nachum Dershowitz,et al. When are Two Algorithms the Same? , 2008, The Bulletin of Symbolic Logic.

[53] Rocco A. Servedio,et al. Adaptive Martingale Boosting , 2008, NIPS.

[54] Frank Nielsen,et al. On the Efficient Minimization of Classification Calibrated Surrogates , 2008, NIPS.

[55] Rocco A. Servedio,et al. Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[56] Barry Mazur,et al. Proof and other Dilemmas: When is One Thing Equal to Some Other Thing? , 2008 .

[57] Frank Nielsen,et al. A Real generalization of discrete AdaBoost , 2006, Artif. Intell..

[58] Rocco A. Servedio,et al. Boosting in the presence of noise , 2003, STOC '03.

[59] Yishay Mansour,et al. Boosting Using Branching Programs , 2000, J. Comput. Syst. Sci..

[60] J. Friedman. Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[61] Shun-ichi Amari,et al. Methods of information geometry , 2000 .

[62] Yoav Freund,et al. The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[63] Yoram Singer,et al. Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[64] Yoav Freund,et al. Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[65] Yishay Mansour,et al. On the boosting ability of top-down decision tree learning algorithms , 1996, STOC '96.

[66] Robert C. Williamson,et al. Local minima and attractors at infinity for gradient descent learning algorithms , 1996 .

[67] Vladimir Vovk,et al. A game of prediction with expert advice , 1995, COLT '95.

[68] Uwe Helmke,et al. Existence and uniqueness results for neural network approximations , 1995, IEEE Trans. Neural Networks.

[69] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[70] Donald E. Knuth. Two notes on notation , 1992 .

[71] J. Berger. Statistical Decision Theory and Bayesian Analysis , 1988 .

[72] Leslie G. Valiant,et al. On the learnability of Boolean formulae , 1987, STOC.

[73] L. J. Savage. Elicitation of Personal Probabilities and Expectations , 1971 .

[74] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[75] E. H. Shuford,et al. Admissible probability measurement procedures , 1966, Psychometrika.

[76] Abraham Wald,et al. Statistical Decision Functions , 1951 .