Optimization by gradient boosting

Gradient boosting is a state-of-the-art prediction technique that sequentially produces a model in the form of linear combinations of simple predictors---typically decision trees---by solving an infinite-dimensional convex optimization problem. We provide in the present paper a thorough analysis of two widespread versions of gradient boosting, and introduce a general framework for studying these algorithms from the point of view of functional optimization. We prove their convergence as the number of iterations tends to infinity and highlight the importance of having a strongly convex risk functional to minimize. We also present a reasonable statistical context ensuring consistency properties of the boosting predictors as the sample size grows. In our approach, the optimization procedures are run forever (that is, without resorting to an early stopping strategy), and statistical regularization is basically achieved via an appropriate $L^2$ penalization of the loss and strong convexity arguments.

[1]  J. Friedman Stochastic gradient boosting , 2002 .

[2]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[3]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[4]  L. Devroye,et al.  Nonparametric density estimation : the L[1] view , 1987 .

[5]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[7]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[8]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[9]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[10]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[11]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[12]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[13]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[14]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[15]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[16]  L. Breiman Arcing the edge , 1997 .

[17]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[20]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[21]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[22]  R. Handel Probability in High Dimension , 2014 .

[23]  T. Broadbent Measure and Integral , 1957, Nature.

[24]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[25]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[26]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[27]  P. Bickel,et al.  Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[28]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[29]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[30]  Shie Mannor,et al.  Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity , 2003, J. Mach. Learn. Res..

[31]  Gilles Blanchard,et al.  On the Rate of Convergence of Regularized Boosting Classifiers , 2003, J. Mach. Learn. Res..

[32]  L. Breiman Population theory for boosting ensembles , 2003 .