Early stopping for non-parametric regression: An optimal data-dependent stopping rule

The goal of non-parametric regression is to estimate an unknown function f<sup>∗</sup> based on n i.i.d. observations of the form y<inf>i</inf> = f<sup>∗</sup>(x<inf>i</inf>) + w<inf>i</inf>, where {w<inf>i</inf>}<sup>n</sup><inf>i=1</inf> are additive noise variables. Simply choosing a function to minimize the least-squares loss 1/2n ∑<sup>n</sup><inf>i=1</inf> (y<inf>i</inf> − f(x<inf>i</inf>))<sup>2</sup> will lead to “overfitting”, so that various estimators are based on different types of regularization. The early stopping strategy is to run an iterative algorithm such as gradient descent for a fixed but finite number of iterations. Early stopping is known to yield estimates with better prediction accuracy than those obtained by running the algorithm for an infinite number of iterations. Although bounds on this prediction error are known for certain function classes and step size choices, the bias-variance tradeoffs for arbitrary reproducing kernel Hilbert spaces (RKHSs) and arbitrary choices of step-sizes have not been well-understood to date. In this paper, we derive upper bounds on both the L<sup>2</sup>(P<inf>n</inf>) and L<sup>2</sup>(P) error for arbitrary RKHSs, and provide an explicit and easily computable data-dependent stopping rule. In particular, it depends only on the sum of step-sizes and the eigenvalues of the empirical kernel matrix for the RKHS. For Sobolev spaces and finite-rank kernel classes, we show that our stopping rule yields estimates that achieve the statistically optimal rates in a minimax sense.

[1]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[2]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[3]  M. Birman,et al.  PIECEWISE-POLYNOMIAL APPROXIMATIONS OF FUNCTIONS OF THE CLASSES $ W_{p}^{\alpha}$ , 1967 .

[4]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[5]  F. T. Wright A Bound on Tail Probabilities for Quadratic Forms in Independent Random Variables Whose Distributions are not Necessarily Symmetric , 1973 .

[6]  O. Strand Theory and methods related to the singular-function expansion and Landweber's iteration for integral equations of the first kind , 1974 .

[7]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[8]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[9]  P. M. Prenter,et al.  A formal comparison of methods proposed for the numerical solution of first kind integral equations , 1981, The Journal of the Australian Mathematical Society. Series B. Applied Mathematics.

[10]  H. Weinert Reproducing kernel Hilbert spaces: Applications in statistical signal processing , 1982 .

[11]  C. J. Stone,et al.  Additive Regression and Other Nonparametric Models , 1985 .

[12]  Grace Wahba,et al.  THREE TOPICS IN ILL-POSED PROBLEMS , 1987 .

[13]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[14]  Hervé Bourlard,et al.  Generalization and Parameter Estimation in Feedforward Netws: Some Experiments , 1989, NIPS.

[15]  J. Marron,et al.  On variance estimation in nonparametric regression , 1990 .

[16]  G. Wahba Spline models for observational data , 1990 .

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[19]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[20]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[21]  V. Buldygin,et al.  Metric characterization of random variables and random processes , 2000 .

[22]  S. Geer Empirical Processes in M-Estimation , 2000 .

[23]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[24]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[25]  M. Ledoux The concentration of measure phenomenon , 2001 .

[26]  B. Yu,et al.  Boosting with the L 2-loss regression and classification , 2001 .

[27]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[28]  Shahar Mendelson,et al.  Geometric Parameters of Kernel Machines , 2002, COLT.

[29]  Chong Gu Smoothing Spline Anova Models , 2002 .

[30]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[31]  Wenxin Jiang Process consistency for AdaBoost , 2003 .

[32]  Chong Gu Model diagnostics for smoothing spline ANOVA models , 2004 .

[33]  Bogdan E. Popescu,et al.  Gradient Directed Regularization , 2004 .

[34]  Philip D. Plowright,et al.  Convexity , 2019, Optimization for Chemical and Biochemical Engineering.

[35]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[36]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[37]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[38]  Tong Zhang,et al.  Learning Bounds for Kernel Regression Using Effective Data Dimensionality , 2005, Neural Computation.

[39]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[40]  A. Caponnetto Optimal Rates for Regularization Operators in Learning Theory , 2006 .

[41]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[42]  Y. Yao,et al.  Adaptation for Regularization Operators in Learning Theory , 2006 .

[43]  Peter L. Bartlett,et al.  AdaBoost is Consistent , 2006, J. Mach. Learn. Res..

[44]  Lorenzo Rosasco,et al.  On regularization algorithms in learning theory , 2007, J. Complex..

[45]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[46]  A. Barron,et al.  Approximation and learning by greedy algorithms , 2008, 0803.1718.

[47]  Lorenzo Rosasco,et al.  Adaptive Kernel Methods Using the Balancing Principle , 2010, Found. Comput. Math..

[48]  Gilles Blanchard,et al.  Optimal learning rates for Kernel Conjugate Gradient regression , 2010, NIPS.

[49]  Y. Yao,et al.  Cross-validation based adaptation for regularization operators in learning theory , 2010 .

[50]  Michael W. Mahoney,et al.  Implementing regularization implicitly via approximate eigenvector computation , 2010, ICML.

[51]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..