Boosting with the L 2-Loss : Regression and Classi cationPeter

This paper investigates a computationally simple variant of boosting, L 2 Boost, which is constructed from a functional gradient descent algorithm with the L 2-loss function. As other boosting algorithms, L 2 Boost uses many times in an iterative fashion a pre-chosen tting method, called the learner. Based on the explicit expression of reetting of residuals of L 2 Boost, the case with (symmetric) linear learners is studied in detail in both regression and classiication. In particular, with the boosting iteration m working as the smoothing or regularization parameter, a new exponential bias-variance trade oo is found with the variance (complexity) term increasing very slowly as m tends to innnity. When the learner is a smoothing spline, an optimal rate of convergence result holds for both regression and classiication and the boosted smoothing spline even adapts to higher order, unknown smoothness. Moreover, a simple expansion of a (smoothed) 0-1 loss function is derived to reveal the importance of the decision boundary, bias reduction, and impossibility of an additive bias-variance decomposition in classiication. Finally, simulation and real data set results are obtained to demonstrate the attractiveness of L 2 Boost. In particular, we demonstrate that L 2 Boosting with a novel component-wise cubic smoothing spline is both practical and eeective in the presence of high-dimensional predictors.

[1]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[4]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[5]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[6]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[7]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[8]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[9]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[10]  Guohua Pan,et al.  Local Regression and Likelihood , 1999, Technometrics.

[11]  J. Marron Optimal Rates of Convergence to Bayes Risk in Nonparametric Discrimination , 1983 .

[12]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[13]  Yuhong Yang,et al.  Minimax Nonparametric Classification—Part I: Rates of Convergence , 1998 .

[14]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[15]  Chong Gu What happens when bootstrapping the smoothing spline , 1987 .

[16]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[17]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[18]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[19]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[20]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[21]  F. Utreras Natural spline functions, their associated eigenvalue problem , 1983 .

[22]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[23]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[24]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.