Early Stopping for Kernel Boosting Algorithms: A General Analysis With Localized Complexities

Early stopping of iterative algorithms is a widely used form of regularization in statistics, commonly used in conjunction with boosting and related gradient-type algorithms. Although consistency results have been established in some settings, such estimators are less well-understood than their analogues based on penalized regularization. In this paper, for a relatively broad class of loss functions and boosting algorithms (including $L^{2}$ -boost, LogitBoost, and AdaBoost, among others), we exhibit a direct connection between the performance of a stopped iterate and the localized Gaussian complexity of the associated function class. This connection allows us to show that the local fixed point analysis of Gaussian or Rademacher complexities, now standard in the analysis of penalized estimators, can be used to derive optimal stopping rules. We derive such stopping rules in detail for various kernel classes and illustrate the correspondence of our theory with practice for Sobolev kernel classes.

[1]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[2]  O. Strand Theory and methods related to the singular-function expansion and Landweber's iteration for integral equations of the first kind , 1974 .

[3]  P. M. Prenter,et al.  A formal comparison of methods proposed for the numerical solution of first kind integral equations , 1981, The Journal of the Australian Mathematical Society. Series B. Applied Mathematics.

[4]  Grace Wahba,et al.  THREE TOPICS IN ILL-POSED PROBLEMS , 1987 .

[5]  G. Wahba Spline models for observational data , 1990 .

[6]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[9]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[10]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[11]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[12]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[13]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[14]  Rich Caruana,et al.  Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping , 2000, NIPS.

[15]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[16]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[17]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[18]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[19]  M. Ledoux The concentration of measure phenomenon , 2001 .

[20]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[21]  Shahar Mendelson,et al.  Geometric Parameters of Kernel Machines , 2002, COLT.

[22]  Chong Gu Smoothing Spline Anova Models , 2002 .

[23]  L. Györfi,et al.  A Distribution-Free Theory of Nonparametric Regression (Springer Series in Statistics) , 2002 .

[24]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[25]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[26]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[27]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[28]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[29]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[30]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[31]  A. Caponnetto Optimal Rates for Regularization Operators in Learning Theory , 2006 .

[32]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[33]  Y. Yao,et al.  Adaptation for Regularization Operators in Learning Theory , 2006 .

[34]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[35]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[36]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[37]  Lorenzo Rosasco,et al.  Adaptive Kernel Methods Using the Balancing Principle , 2010, Found. Comput. Math..

[38]  Martin J. Wainwright,et al.  Early stopping for non-parametric regression: An optimal data-dependent stopping rule , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[39]  Lutz Prechelt,et al.  Early Stopping - But When? , 2012, Neural Networks: Tricks of the Trade.

[40]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[41]  Yubin,et al.  Early stopping and non-parametric regression , 2014 .

[42]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.

[43]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[44]  Lorenzo Rosasco,et al.  NYTRO: When Subsampling Meets Early Stopping , 2015, AISTATS.

[45]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[46]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[47]  Volkan Cevher,et al.  Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces , 2018, Applied and Computational Harmonic Analysis.