Regularization in statistics

This paper is a selective review of the regularization methods scattered in statistics literature. We introduce a general conceptual approach to regularization and fit most existing methods into it. We have tried to focus on the importance of regularization when dealing with today's high-dimensional objects: data and models. A wide range of examples are discussed, including nonparametric regression, boosting, covariance matrix estimation, principal component estimation, subsampling.

[1]  E. Wigner Characteristic Vectors of Bordered Matrices with Infinite Dimensions I , 1955 .

[2]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[3]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[4]  E. Nadaraya On Estimating Regression , 1964 .

[5]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[6]  E. Nadaraya On Non-Parametric Estimates of Density Functions and Regression Curves , 1965 .

[7]  J. Hodges Efficiency in normal samples and tolerance of extreme values for some estimates of location , 1967 .

[8]  H. Akaike Statistical predictor identification , 1970 .

[9]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[10]  C. L. Mallows Some comments on C_p , 1973 .

[11]  W. Strawderman The Generalized Jackknife Statistic , 1973 .

[12]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[13]  H. L. Gray,et al.  The Generalised Jackknife Statistic , 1974 .

[14]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[15]  Farhad Mehran,et al.  The Generalized Jackknife Statistic , 1975 .

[16]  S. Ross The arbitrage theory of capital asset pricing , 1976 .

[17]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[18]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[19]  K. Wachter The Strong Limits of Random Matrix Spectra for Sample Matrices of Independent Elements , 1978 .

[20]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[21]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[22]  Mario Bertero,et al.  The Stability of Inverse Problems , 1980 .

[23]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[24]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[25]  M. Rothschild,et al.  Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets , 1982 .

[26]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[27]  Ker-Chau Li,et al.  From Stein's Unbiased Risk Estimates to the Method of Generalized Cross Validation , 1985 .

[28]  Ker-Chau Li,et al.  Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing , 1986 .

[29]  Ker-Chau Li,et al.  Asymptotic Optimality for $C_p, C_L$, Cross-Validation and Generalized Cross-Validation: Discrete Index Set , 1987 .

[30]  Hung Chen,et al.  Convergence Rates for Parametric Components in a Partly Linear Model , 1988 .

[31]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[32]  G. Wahba Spline models for observational data , 1990 .

[33]  Leo Breiman,et al.  Robust confidence bounds for extreme upper quantiles , 1990 .

[34]  E. Mammen When does bootstrap work , 1992 .

[35]  E. Mammen When Does Bootstrap Work?: Asymptotic Results and Simulations , 1992 .

[36]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[37]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[38]  E. Fama,et al.  Common risk factors in the returns on stocks and bonds , 1993 .

[39]  George G. Lorentz,et al.  Constructive Approximation , 1993, Grundlehren der mathematischen Wissenschaften.

[40]  D. Cox An Analysis of Bayesian Inference for Nonparametric Regression , 1993 .

[41]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[42]  Joseph P. Romano,et al.  Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions , 1994 .

[43]  Jianqing Fan,et al.  Local polynomial modelling and its applications , 1994 .

[44]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[45]  Danny Kopec,et al.  Additional References , 2003 .

[46]  P. Hall,et al.  On blocking rules for the bootstrap with dependent data , 1995 .

[47]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[48]  Somnath Datta,et al.  Bootstrap Inference for a First-Order Autoregression with Positive Innovations , 1995 .

[49]  C. Mallows More comments on C p , 1995 .

[50]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[51]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[52]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[53]  James M. Robins,et al.  Causal Inference from Complex Longitudinal Data , 1997 .

[54]  P. Massart,et al.  From Model Selection to Adaptive Estimation , 1997 .

[55]  Young K. Truong,et al.  Polynomial splines and their tensor products in extended linear modeling: 1994 Wald memorial lecture , 1997 .

[56]  Maia Berkane Latent Variable Modeling and Applications to Causality , 1997 .

[57]  J. Shao AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION , 1997 .

[58]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[59]  I. Johnstone,et al.  Minimax estimation via wavelet shrinkage , 1998 .

[60]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[61]  N. Draper,et al.  Applied Regression Analysis: Draper/Applied Regression Analysis , 1998 .

[62]  G. Lugosi,et al.  Adaptive Model Selection Using Empirical Complexities , 1998 .

[63]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[64]  A. Böttcher,et al.  Introduction to Large Truncated Toeplitz Matrices , 1998 .

[65]  G. Lugosi,et al.  On Prediction of Individual Sequences , 1998 .

[66]  C. H. Oh,et al.  Some comments on , 1998 .

[67]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[68]  Bruno Torrésani,et al.  Time-Frequency and Time-Scale Analysis , 1999 .

[69]  D. Freedman On the Bernstein-von Mises Theorem with Infinite Dimensional Parameters , 1999 .

[70]  M. Pourahmadi Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation , 1999 .

[71]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[72]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[73]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[74]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[75]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[76]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[77]  M. Pourahmadi Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix , 2000 .

[78]  A. Juditsky,et al.  Functional aggregation for nonparametric regression , 2000 .

[79]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[80]  A. W. van der Vaart,et al.  On Profile Likelihood , 2000 .

[81]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[82]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions , 2000 .

[83]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[84]  P. Bickel,et al.  Non- and semiparametric statistics: compared and contrasted , 2000 .

[85]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[86]  F. Götze,et al.  Adaptive choice of bootstrap sample sizes , 2001 .

[87]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[88]  Arnold J Stromberg,et al.  Subsampling , 2001, Technometrics.

[89]  P. Massart,et al.  Gaussian model selection , 2001 .

[90]  Jianqing Fan,et al.  Generalized likelihood ratio statistics and Wilks phenomenon , 2001 .

[91]  Jianqing Fan,et al.  Regularization of Wavelet Approximations , 2001 .

[92]  Sophie Lambert-Lacroix,et al.  On nonparametric confidence set estimation , 2001 .

[93]  I. Daubechies,et al.  Tree Approximation and Optimal Encoding , 2001 .

[94]  O. Lepski,et al.  Random rates in anisotropic regression (with a discussion and a rejoinder by the authors) , 2002 .

[95]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[96]  L. Györfi,et al.  A Distribution-Free Theory of Nonparametric Regression (Springer Series in Statistics) , 2002 .

[97]  R. Kohn,et al.  Parsimonious Covariance Matrix Estimation for Longitudinal Data , 2002 .

[98]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[99]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[100]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[101]  S. Ghosal,et al.  On Bayesian Adaptation , 2003 .

[102]  Gerard Kerkyacharian,et al.  Entropy, Universal Coding, Approximation, and Bases Properties , 2003 .

[103]  E. Belitser,et al.  Adaptive Bayesian inference on the mean of an infinite-dimensional normal distribution , 2003 .

[104]  Shie Mannor,et al.  Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity , 2003, J. Mach. Learn. Res..

[105]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[106]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[107]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[108]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[109]  M. Pourahmadi,et al.  Nonparametric estimation of large covariance matrices of longitudinal data , 2003 .

[110]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[111]  T. Cai,et al.  An adaptation theory for nonparametric confidence intervals , 2004, math/0503662.

[112]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[113]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[114]  S. Keleş,et al.  Asymptotically optimal model selection method with right censored outcomes , 2004 .

[115]  T. Valdés,et al.  Mean‐Based Iterative Procedures in Linear Models with General Errors and Grouped Data , 2004 .

[116]  Meta M. Voelker,et al.  Variable Selection and Model Building via Likelihood Basis Pursuit , 2004 .

[117]  Tzee-Ming Huang Convergence rates for posterior distributions and adaptive estimation , 2004, math/0410087.

[118]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[119]  Yuhong Yang Aggregating regression procedures to improve performance , 2004 .

[120]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[121]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[122]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[123]  D. Paul,et al.  Asymptotics of the leading sample eigenvalues for a spiked covariance model , 2004 .

[124]  B. Efron The Estimation of Prediction Error , 2004 .

[125]  Jianqing Fan,et al.  Removing intensity effects and identifying significant genes for Affymetrix arrays in macrophage migration inhibitory factor-suppressed neuroblastoma cells. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[126]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[127]  Alexander V. Nazin,et al.  Recursive Aggregation of Estimators by the Mirror Descent Algorithm with Averaging , 2005, Probl. Inf. Transm..

[128]  P. Bickel,et al.  On the Choice of m in the m Out of n Bootstrap and its Application to Condence Bounds for Extreme Percentiles y , 2005 .

[129]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[130]  M. Kosorok,et al.  Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data , 2005, math/0508219.

[131]  T. Tony Cai,et al.  On Adaptive Estimation of Linear Functionals , 2005 .

[132]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[133]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[134]  J. Robins,et al.  Robust inference with higher order influence functions: Part I, Part II , 2005 .

[135]  BOOSTING WITH EARLY STOPPING: CONVERGENCE , 2005 .

[136]  Jianqing Fan,et al.  Nonparametric Inferences for Additive Models , 2005 .

[137]  I. Johnstone,et al.  Empirical Bayes selection of wavelet thresholds , 2005, math/0508281.

[138]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[139]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[140]  Jianqing Fan,et al.  Semilinear High-Dimensional Model for Normalization of Microarray Data , 2005 .

[141]  M. Wegkamp,et al.  Consistent variable selection in high dimensional regression via multiple testing , 2006 .

[142]  Jussi Klemelä Density estimation with stagewise optimization of the empirical risk , 2006, Machine Learning.

[143]  J. Robins,et al.  Adaptive nonparametric confidence sets , 2006, math/0605473.

[144]  Jianhua Z. Huang,et al.  Covariance matrix selection and estimation via penalised normal likelihood , 2006 .

[145]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[146]  Ronald A. DeVore,et al.  Approximation Methods for Supervised Learning , 2006, Found. Comput. Math..

[147]  Peter Buhlmann Boosting for high-dimensional linear models , 2006, math/0606789.

[148]  P. Bühlmann,et al.  Sparse Boosting , 2006, J. Mach. Learn. Res..

[149]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[150]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[151]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[152]  P. Bickel,et al.  Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[153]  E. Greenshtein Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[154]  Florentina Bunea,et al.  Aggregation and sparsity via 1 penalized least squares , 2006 .

[155]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[156]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions for non-i.i.d. observations , 2007, 0708.0491.

[157]  T. Bengtsson,et al.  Estimation of high-dimensional prior and posterior covariance matrices in Kalman filter variants , 2007 .

[158]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[159]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[160]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.