Risk of penalized least squares, greedy selection andl 1-penalization for flexible function libraries

For function estimation using penalized squared error criteria, we derive generally applicable risk bounds, showing the balance of accuracy of approximation and penalty relative to the sample size. Attention is given to linear combinations of terms from a given class (such as used in neural network models, projection pursuit regression, function aggregation and multiple linear regression). The risk bounds apply to forward stepwise selection and other relaxed greedy algorithms with penalty on the number of terms, and to ‘1penalized least squares, for which we develop a fast algorithm. 1. Introduction. Flexible regression models are built by combining simple functional forms. Fitting such models to data in a training sample, there is a role for empirical performance criteria such as penalized squared error in selecting components of the function from a given library of candidate terms. With suitable penalty, optimizing the criterion adapts the total weights of combination or the number of components as well as the subset of terms to include. The aim is to produce function estimates which accurately predict responses for new input values with the same distribution as the sample. This generalization capability is characterized by the mean squared error as the statistical risk. In this context, our paper has several interwoven objectives: 1. To analyze performance of penalized least squares estimators with theory of acceptable penalties, such that the estimator optimizing the empirical criterion has risk characterized by a corresponding population property of tradeoff of approximation and penalty relative to the sample size. 2. To allow for flexible function fitting using linear combinations of terms selected from various large or even infinite libraries of functions. 3. To establish that a greedy term selection solves the‘1 penalized squared error problem with bounds on accuracy that compare favorably with competing convex optimization algorithms for large libraries. 4. To demonstrate that different estimators, one based on forward stepwise selection with penalty on the number of terms and another with penalty on the ‘1 norm of coefficients, both achieve approximately the same risk, for target functions that have control on the ‘1 norm of coefficients and for functions in the interpolation classes between these and all of L2.

[1]  H. Akaike Fitting autoregressive models for prediction , 1969 .

[2]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[3]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4]  R. Tapia,et al.  Nonparametric Maximum Likelihood Estimation of Probability Densities by Penalty Function Methods , 1975 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[7]  I. Good,et al.  Density Estimation and Bump-Hunting by the Penalized Likelihood Method Exemplified by Scattering and Meteorite Data , 1980 .

[8]  R. Shibata An optimal selection of regression variables , 1981 .

[9]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[10]  B. Silverman,et al.  On the Estimation of a Probability Density Function by the Maximum Penalized Likelihood Method , 1982 .

[11]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[12]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[13]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[14]  Ker-Chau Li,et al.  Asymptotic Optimality for $C_p, C_L$, Cross-Validation and Generalized Cross-Validation: Discrete Index Set , 1987 .

[15]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[16]  G. Wahba Spline models for observational data , 1990 .

[17]  D. Cox,et al.  Asymptotic Analysis of Penalized Likelihood and Related Estimators , 1990 .

[18]  A. Barron,et al.  Discussion: Multivariate Adaptive Regression Splines , 1991 .

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[21]  J H Frieadman MULTIVARIATE ADDITIVE REGRESSION SPLINES , 1991 .

[22]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[23]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[24]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[25]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[26]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[27]  D. Donoho,et al.  Basis pursuit , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[28]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[29]  C. Mallows More comments on C p , 1995 .

[30]  Y. Makovoz Random Approximants and Neural Networks , 1996 .

[31]  Dennis D. Cox,et al.  Penalized Likelihood-type Estimators for Generalized Nonparametric Regression , 1996 .

[32]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[33]  Ronald A. DeVore,et al.  Some remarks on greedy algorithms , 1996, Adv. Comput. Math..

[34]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[35]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[36]  Tj Sweeting,et al.  Invited discussion of A. R. Barron: Information-theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems , 1998 .

[37]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[38]  Xiaotong Shen ON THE METHOD OF PENALIZATION , 1998 .

[39]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[40]  Yuhong Yang,et al.  An Asymptotic Property of Model Selection Criteria , 1998, IEEE Trans. Inf. Theory.

[41]  P. Massart,et al.  Minimum contrast estimators on sieves: exponential bounds and rates of convergence , 1998 .

[42]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[43]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[44]  Andrew R. Barron,et al.  Estimation with two hidden layer neural nets , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[45]  E. Candès,et al.  Ridgelets: a key to higher-dimensional intermittency? , 1999, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[46]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[47]  Y. Baraud Model selection for regression on a fixed design , 2000 .

[48]  A. Juditsky,et al.  Functional aggregation for nonparametric regression , 2000 .

[49]  Model Selection In Non-Parametric Regression , 2000 .

[50]  M. R. Osborne,et al.  On the LASSO and its Dual , 2000 .

[51]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[52]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[53]  Yuhong Yang Combining Different Procedures for Adaptive Regression , 2000, Journal of Multivariate Analysis.

[54]  Andrew R. Barron,et al.  Penalized least squares, model selection, convex hull classes and neural nets , 2001, ESANN.

[55]  D. Donoho,et al.  Atomic Decomposition by Basis Pursuit , 2001 .

[56]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[57]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[58]  P. Massart,et al.  Gaussian model selection , 2001 .

[59]  V. Temlyakov,et al.  Two Lower Estimates in Greedy Approximation , 2001 .

[60]  Y. Baraud Model selection for regression on a random design , 2002 .

[61]  J. Friedman Stochastic gradient boosting , 2002 .

[62]  Jerome H. Friedman,et al.  Tutorial: Getting Started with MART in R , 2002 .

[63]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[64]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[65]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[66]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[67]  J. Picard,et al.  Statistical learning theory and stochastic optimization : École d'eté de probabilités de Saint-Flour XXXI - 2001 , 2004 .

[68]  Mark J van der Laan,et al.  Deletion/Substitution/Addition Algorithm in Learning with Applications in Genomics , 2004, Statistical applications in genetics and molecular biology.

[69]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[70]  Yuhong Yang Aggregating regression procedures to improve performance , 2004 .

[71]  E. Candès,et al.  New tight frames of curvelets and optimal representations of objects with piecewise C2 singularities , 2004 .

[72]  E. Livshits,et al.  Rate of Convergence of Pure Greedy Algorithms , 2004 .

[73]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[74]  V. Koltchinskii,et al.  Complexities of convex combinations and bounding the generalization error in classification , 2004, math/0405356.

[75]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[76]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[77]  Florentina Bunea,et al.  Sparse Density Estimation with l1 Penalties , 2007, COLT.

[78]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[79]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[80]  A. Barron,et al.  Adaptive Annealing , 2008 .

[81]  A. Barron,et al.  Approximation and learning by greedy algorithms , 2008, 0803.1718.

[82]  Tong Zhang Some sharp performance bounds for least squares regression with L1 regularization , 2009, 0908.2869.

[83]  W. Silverman BY THE MAXIMUM PENALIZED LIKELIHOOD METHOD , .

[84]  R. A. Gaskins,et al.  Nonparametric roughness penalties for probability densities , 2022 .