Boosting with early stopping: Convergence and consistency

Boosting is one of the most significant advances in machine learning for classification and regression. In its original and computationally flexible version, boosting seeks to minimize empirically a loss function in a greedy fashion. The resulting estimator takes an additive function form and is built iteratively by applying a base estimator (or learner) to updated samples depending on the previous iterations. An unusual regularization technique, early stopping, is employed based on CV or a test set. This paper studies numerical convergence, consistency and statistical rates of convergence of boosting with early stopping, when it is carried out over the linear span of a family of basis functions. For general loss functions, we prove the convergence of boosting's greedy optimization to the infinimum of the loss function over the linear span. Using the numerical convergence result, we find early-stopping strategies under which boosting is shown to be consistent based on i.i.d. samples, and we obtain bounds on the rates of convergence for boosting estimators. Simulation studies are also presented to illustrate the relevance of our theoretical results for providing insights to practical aspects of boosting. As a side product, these results also reveal the importance of restricting the greedy search step-sizes. as known in practice through the work of Friedman and others. Moreover, our results lead to a rigorous proof that for a linearly separable problem, AdaBoost with E → 0 step-size becomes an L 1 -margin maximizer when left to run to convergence.

[1]  J. Kuelbs Probability on Banach spaces , 1978 .

[2]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[3]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[4]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[5]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[6]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[9]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[10]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[13]  L. Breiman Arcing Classifiers , 1998 .

[14]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[15]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[16]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[17]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[18]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[19]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[20]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[21]  B. Yu,et al.  Boosting with the L 2-loss regression and classification , 2001 .

[22]  Dmitry Panchenko,et al.  Further Explanation of the Effectiveness of Voting Methods: The Game between Margins and Weights , 2001, COLT/EuroCOLT.

[23]  Dmitry Panchenko,et al.  Some Local Measures of Complexity of Convex Hulls and Generalization Bounds , 2002, COLT.

[24]  P. Bühlmann Consistency for L₂boosting and matching pursuit with trees and tree-type basis functions , 2002 .

[25]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[26]  Shie Mannor,et al.  The Consistency of Greedy Algorithms for Classification , 2002, COLT.

[27]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[28]  Yiming Yang,et al.  A Loss Function Analysis for Classification Methods in Text Categorization , 2003, ICML.

[29]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[30]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[31]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[32]  Shie Mannor,et al.  Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity , 2003, J. Mach. Learn. Res..

[33]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[34]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[35]  Gilles Blanchard,et al.  On the Rate of Convergence of Regularized Boosting Classifiers , 2003, J. Mach. Learn. Res..

[36]  L. Breiman Population theory for boosting ensembles , 2003 .

[37]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[38]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[39]  Philip D. Plowright,et al.  Convexity , 2019, Optimization for Chemical and Biochemical Engineering.

[40]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[41]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[42]  V. Koltchinskii,et al.  Complexities of convex combinations and bounding the generalization error in classification , 2004, math/0405356.

[43]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .