The Rate of Convergence of Adaboost

The AdaBoost algorithm was designed to combine many "weak" hypotheses that perform slightly better than random guessing into a "strong" hypothesis that has very low error. We study the rate at which AdaBoost iteratively converges to the minimum of the "exponential loss." Unlike previous work, our proofs do not require a weak-learning assumption, nor do they require that minimizers of the exponential loss are finite. Our first result shows that the exponential loss of AdaBoost's computed parameter vector will be at most e more than that of any parameter vector of l1-norm bounded by B in a number of rounds that is at most a polynomial in B and 1/e. We also provide lower bounds showing that a polynomial dependence is necessary. Our second result is that within C/e iterations, AdaBoost achieves a value of the exponential loss that is at most e more than the best possible value, where C depends on the data set. We show that this dependence of the rate on e is optimal up to constant factors, that is, at least Ω(1/e) rounds are necessary to achieve within e of the optimal exponential loss.

[1]  P. Tseng,et al.  On the convergence of the coordinate descent method for convex differentiable minimization , 1992 .

[2]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[3]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[4]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[5]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[6]  Gunnar Rätsch,et al.  An asymptotic analysis of AdaBoost in the binary classification case , 1998 .

[7]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[8]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[9]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[10]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[11]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[12]  Gunnar Rätsch,et al.  On the Convergence of Leveraging , 2001, NIPS.

[13]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[14]  Cynthia Rudin,et al.  The Dynamics of AdaBoost: Cyclic Behavior and Convergence of Margins , 2004, J. Mach. Learn. Res..

[15]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[16]  Gunnar Rätsch,et al.  Efficient Margin Maximizing with Boosting , 2005, J. Mach. Learn. Res..

[17]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[18]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[19]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[20]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[21]  P. Bickel,et al.  Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[22]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[23]  R. Schapire,et al.  Analysis of boosting algorithms using the smooth margin function , 2007, 0803.4092.

[24]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.

[25]  R. Schapire The Convergence Rate of AdaBoost , 2010, Annual Conference Computational Learning Theory.

[26]  The Convergence Rate of AdaBoost and Friends , 2011, ArXiv.

[27]  Tom,et al.  A simple cost function for boostingMarcus , .