Regularization in statistics

This paper is a selective review of the regularization methods scattered in statistics literature. We introduce a general conceptual approach to regularization and fit most existing methods into it. We have tried to focus on the importance of regularization when dealing with today’s high-dimensional objects: data and models. A wide range of examples are discussed, including nonparametric regression, boosting, covariance matrix estimation, principal component estimation, subsampling.

[1]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[2]  E. Nadaraya On Estimating Regression , 1964 .

[3]  E. Mammen When does bootstrap work , 1992 .

[4]  .. W. V. Der,et al.  On Profile Likelihood , 2000 .

[5]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[6]  T. Cai,et al.  An adaptation theory for nonparametric confidence intervals , 2004, math/0503662.

[7]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[8]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[9]  Bruno Torrésani,et al.  Time-Frequency and Time-Scale Analysis , 1999 .

[10]  James M. Robins,et al.  Causal Inference from Complex Longitudinal Data , 1997 .

[11]  D. Freedman On the Bernstein-von Mises Theorem with Infinite Dimensional Parameters , 1999 .

[12]  Hung Chen,et al.  Convergence Rates for Parametric Components in a Partly Linear Model , 1988 .

[13]  M. Wegkamp,et al.  Consistent variable selection in high dimensional regression via multiple testing , 2006 .

[14]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[15]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[16]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[17]  Mario Bertero,et al.  The Stability of Inverse Problems , 1980 .

[18]  Ker-Chau Li,et al.  From Stein's Unbiased Risk Estimates to the Method of Generalized Cross Validation , 1985 .

[19]  I. Johnstone,et al.  Minimax estimation via wavelet shrinkage , 1998 .

[20]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[21]  Ker-Chau Li,et al.  Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing , 1986 .

[22]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[23]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[24]  H. Akaike Statistical predictor identification , 1970 .

[25]  S. Keleş,et al.  Asymptotically optimal model selection method with right censored outcomes , 2004 .

[26]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[27]  T. Valdés,et al.  Mean‐Based Iterative Procedures in Linear Models with General Errors and Grouped Data , 2004 .

[28]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[29]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[30]  Jussi Klemelä Density estimation with stagewise optimization of the empirical risk , 2006, Machine Learning.

[31]  P. Massart,et al.  From Model Selection to Adaptive Estimation , 1997 .

[32]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[33]  F. Götze,et al.  Adaptive choice of bootstrap sample sizes , 2001 .

[34]  S. Ross The arbitrage theory of capital asset pricing , 1976 .

[35]  J. Robins,et al.  Adaptive nonparametric confidence sets , 2006, math/0605473.

[36]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[37]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[38]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[39]  P. Hall,et al.  On blocking rules for the bootstrap with dependent data , 1995 .

[40]  M. Rothschild,et al.  Arbitrage, Factor Structure, and Mean-Variance Analysis on Large Asset Markets , 1982 .

[41]  M. Pourahmadi Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix , 2000 .

[42]  M. Pourahmadi Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation , 1999 .

[43]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[44]  Jianhua Z. Huang,et al.  Covariance matrix selection and estimation via penalised normal likelihood , 2006 .

[45]  Alexander V. Nazin,et al.  Recursive Aggregation of Estimators by the Mirror Descent Algorithm with Averaging , 2005, Probl. Inf. Transm..

[46]  L. Wasserman,et al.  A Reference Bayesian Test for Nested Hypotheses and its Relationship to the Schwarz Criterion , 1995 .

[47]  A. Juditsky,et al.  Functional aggregation for nonparametric regression , 2000 .

[48]  Farhad Mehran,et al.  The Generalized Jackknife Statistic , 1975 .

[49]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[50]  Tzee-Ming Huang Convergence rates for posterior distributions and adaptive estimation , 2004, math/0410087.

[51]  G. Wahba Spline models for observational data , 1990 .

[52]  P. Bickel,et al.  On the Choice of m in the m Out of n Bootstrap and its Application to Condence Bounds for Extreme Percentiles y , 2005 .

[53]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[54]  Runze Li,et al.  Statistical Challenges with High Dimensionality: Feature Selection in Knowledge Discovery , 2006, math/0602133.

[55]  Ronald A. DeVore,et al.  Approximation Methods for Supervised Learning , 2006, Found. Comput. Math..

[56]  Gerard Kerkyacharian,et al.  Entropy, Universal Coding, Approximation, and Bases Properties , 2003 .

[57]  M. Kosorok,et al.  Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data , 2005, math/0508219.

[58]  T. Tony Cai,et al.  On Adaptive Estimation of Linear Functionals , 2005 .

[59]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[60]  Arnold J Stromberg,et al.  Subsampling , 2001, Technometrics.

[61]  Peter Buhlmann Boosting for high-dimensional linear models , 2006, math/0606789.

[62]  K. Wachter The Strong Limits of Random Matrix Spectra for Sample Matrices of Independent Elements , 1978 .

[63]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[64]  Joseph P. Romano,et al.  Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions , 1994 .

[65]  D. Hunter,et al.  Variable Selection using MM Algorithms. , 2005, Annals of statistics.

[66]  Yuhong Yang Aggregating regression procedures to improve performance , 2004 .

[67]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[68]  R. Kohn,et al.  Parsimonious Covariance Matrix Estimation for Longitudinal Data , 2002 .

[69]  E. Belitser,et al.  Adaptive Bayesian inference on the mean of an infinite-dimensional normal distribution , 2003 .

[70]  BOOSTING WITH EARLY STOPPING: CONVERGENCE , 2005 .

[71]  Jianqing Fan,et al.  Local polynomial modelling and its applications , 1994 .

[72]  P. Bühlmann,et al.  Sparse Boosting , 2006, J. Mach. Learn. Res..

[73]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[74]  E. Fama,et al.  Common risk factors in the returns on stocks and bonds , 1993 .

[75]  Maia Berkane Latent Variable Modeling and Applications to Causality , 1997 .

[76]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[77]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[78]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[79]  G. Lugosi,et al.  Adaptive Model Selection Using Empirical Complexities , 1998 .

[80]  Colin L. Mallows,et al.  Some Comments on Cp , 2000, Technometrics.

[81]  P. Massart,et al.  Gaussian model selection , 2001 .

[82]  Somnath Datta,et al.  Bootstrap Inference for a First-Order Autoregression with Positive Innovations , 1995 .

[83]  Jianqing Fan,et al.  Generalized likelihood ratio statistics and Wilks phenomenon , 2001 .

[84]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[85]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[86]  I. Johnstone,et al.  Empirical Bayes selection of wavelet thresholds , 2005, math/0508281.

[87]  Shie Mannor,et al.  Greedy Algorithms for Classification -- Consistency, Convergence Rates, and Adaptivity , 2003, J. Mach. Learn. Res..

[88]  A. V. D. Vaart,et al.  Convergence rates of posterior distributions , 2000 .

[89]  G. Lugosi,et al.  On the Bayes-risk consistency of regularized boosting methods , 2003 .

[90]  J. Hodges Efficiency in normal samples and tolerance of extreme values for some estimates of location , 1967 .

[91]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[92]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[93]  J. Shao AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION , 1997 .

[94]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[95]  Sophie Lambert-Lacroix,et al.  On nonparametric confidence set estimation , 2001 .

[96]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[97]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[98]  I. Daubechies,et al.  Tree Approximation and Optimal Encoding , 2001 .

[99]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[100]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[101]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[102]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[103]  P. Bickel,et al.  Some Theory for Generalized Boosting Algorithms , 2006, J. Mach. Learn. Res..

[104]  George G. Lorentz,et al.  Constructive Approximation , 1993, Grundlehren der mathematischen Wissenschaften.

[105]  E. Wigner Characteristic Vectors of Bordered Matrices with Infinite Dimensions I , 1955 .

[106]  A. Böttcher,et al.  Introduction to Large Truncated Toeplitz Matrices , 1998 .

[107]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[108]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[109]  E. Greenshtein Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[110]  Florentina Bunea,et al.  Aggregation and sparsity via 1 penalized least squares , 2006 .

[111]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[112]  D. Cox An Analysis of Bayesian Inference for Nonparametric Regression , 1993 .

[113]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[114]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[115]  M. Pourahmadi,et al.  Nonparametric estimation of large covariance matrices of longitudinal data , 2003 .

[116]  Danny Kopec,et al.  Additional References , 2003 .

[117]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[118]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[119]  D. Paul,et al.  Asymptotics of the leading sample eigenvalues for a spiked covariance model , 2004 .

[120]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[121]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[122]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[123]  Leo Breiman,et al.  Robust confidence bounds for extreme upper quantiles , 1990 .

[124]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[125]  P. Bickel,et al.  Non- and semiparametric statistics: compared and contrasted , 2000 .

[126]  Jianqing Fan,et al.  Semilinear High-Dimensional Model for Normalization of Microarray Data , 2005 .

[127]  T. Bengtsson,et al.  Estimation of high-dimensional prior and posterior covariance matrices in Kalman filter variants , 2007 .

[128]  Ker-Chau Li,et al.  Asymptotic Optimality for $C_p, C_L$, Cross-Validation and Generalized Cross-Validation: Discrete Index Set , 1987 .