A Discussion of: "Process Consistency for AdaBoost" by Wenxin Jiang "On the Bayes-risk consistency of regularized boosting methods" by G´ abor Lugosi and Nicolas Vayatis "Statistical Behavior and Consistency of Classification Methods based on Convex Risk Minimization" by Tong Zhang

The notion of a boosting algorithm was originally introduced by Valiant in the context of the “probably approximately correct” (PAC) model of learnability [19]. In this context boosting is a method for provably improving the accuracy of any “weak” classification learning algorithm. The first boosting algorithm was invented by Schapire [16] and the second one by Freund [2]. These two algorithms were introduced for a specific theoretical purpose. However, since the introduction of AdaBoost [5], quite a number of perspectives on boosting have emerged. For instance, AdaBoost can be understood as a method for maximizing the “margins” or “confidences” of the training examples [17]; as a technique for playing repeated matrix games [4, 6]; as a linear or convex programming method [15]; as a functional gradient-descent technique [8, 13, 14, 3]; as a technique for Bregman-distance optimization in a broader framework that includes logistic regression [1, 10, 12]; and finally as a stepwise model-fitting method for minimization of the exponential loss function, an approximation of the negative log binomial likelihood [7]. The current papers add to this list of perspectives, giving a view of boosting that is very different from its original interpretation and analysis as an algorithm for improving the accuracy of a weak learner. These many different points of view add to the richness of the theory of boosting, and are enormously helpful in the practical design of new or better algorithms for machine learning and statistical inference. Originally, boosting algorithms were designed expressly for classification. The goal in this setting is to accurately predict the classification of a new example. Either the prediction is correct, or it is not. There is no attempt made to estimate the conditional probability of each class. In practice, this sometimes is not enough since we may want to have some sense of how likely our prediction is to be correct, or we may want to incorporate numbers that look like probabilities into a larger system. Later, Friedman, Hastie and Tibshirani [7] showed that AdaBoost can in fact be used to estimate such probabilities, arguing that AdaBoost approximates a form of logistic regression. They and others [1] subsequently modified AdaBoost to explicitly minimize the loss function associated with logistic regression, with the intention of computing such estimated probabilities. In one of the current papers, Zhang vastly generalizes this approach showing that conditional probability estimates P{y|x} can be obtained when minimizing any smooth convex loss function,

[1]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[2]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[3]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[7]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[8]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[9]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[10]  Yoav Freund,et al.  An Adaptive Version of the Boost by Majority Algorithm , 1999, COLT.

[11]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[12]  Gunnar Rätsch,et al.  Barrier Boosting , 2000, COLT.

[13]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[14]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[15]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[16]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[17]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[18]  G. Shafer,et al.  Probability and Finance: It's Only a Game! , 2001 .

[19]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[20]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[21]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.