The Cost-free Nature of Optimally Tuning Tikhonov Regularizers and Other Ordered Smoothers

We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on $Q$-aggregation achieves the mean square error of the best estimator, up to a small error term no larger than $C\sigma^2$, where $\sigma^2$ is the noise level and $C>0$ is an absolute constant. Remarkably, the error term does not depend on the penalty matrix or the number of estimators as long as they share the same penalty matrix, i.e., it applies to any grid of tuning parameters, no matter how large the cardinality of the grid is. This reveals the surprising "cost-free" nature of optimally tuning Tikhonov regularizers, in striking contrast with the existing literature on aggregation of estimators where one typically has to pay a cost of $\sigma^2\log(M)$ where $M$ is the number of estimators in the family. The result holds, more generally, for any family of ordered linear smoothers. This encompasses Ridge regression as well as Principal Component Regression. The result is extended to the problem of tuning Tikhonov regularizers with different penalty matrices.

[1]  Pierre C Bellec,et al.  Concentration of quadratic forms under a Bernstein moment assumption , 2019, 1901.08736.

[2]  Yu. Golubev,et al.  Ordered smoothers with exponential weighting , 2012, 1211.4207.

[3]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[4]  Sham M. Kakade,et al.  A tail inequality for quadratic forms of subgaussian random vectors , 2011, ArXiv.

[5]  P. Hall,et al.  Asymptotically optimal difference-based estimation of variance in nonparametric regression , 1990 .

[6]  A. Tsybakov,et al.  Linear and convex aggregation of density estimators , 2006, math/0605292.

[7]  Yu. Golubev On universal oracle inequalities related to high-dimensional linear models , 2010, 1011.2378.

[8]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[9]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[10]  Stephen A. Vavasis,et al.  Complexity Theory: Quadratic Programming , 2009, Encyclopedia of Optimization.

[11]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[12]  A. Tsybakov Aggregation and minimax optimality in high-dimensional estimation , 2014 .

[13]  C. Mallows More comments on C p , 1995 .

[14]  Lawrence D. Brown,et al.  Variance estimation in nonparametric regression via the difference sequence method , 2007, 0712.0898.

[15]  A. Belloni,et al.  Pivotal estimation via square-root Lasso in nonparametric regression , 2011, 1105.1475.

[16]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[17]  Philippe Rigollet,et al.  Kullback-Leibler aggregation and misspecified generalized linear models , 2009, 0911.2919.

[18]  Sylvain Arlot,et al.  Minimal penalties and the slope heuristics: a survey , 2019, 1901.07277.

[19]  Isaac Z. Pesenson,et al.  Average sampling and average splines on combinatorial graphs , 2019, 2019 13th International conference on Sampling Theory and Applications (SampTA).

[20]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[21]  Nicolai Bissantz,et al.  On difference‐based variance estimation in nonparametric regression when the covariate is high dimensional , 2005 .

[22]  Remarks on Kneip's linear smoothers , 2014, 1405.1744.

[23]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[24]  R. Handel Probability in High Dimension , 2014 .

[25]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[26]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[27]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[28]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[29]  A. Owen A robust hybrid of lasso and ridge regression , 2006 .

[30]  Tong Zhang,et al.  Aggregation of Affine Estimators , 2013, ArXiv.

[31]  A. Kneip Ordered Linear Smoothers , 1994 .

[32]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[33]  Sjoerd Dirksen,et al.  Tail bounds via generic chaining , 2013, ArXiv.

[34]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[35]  Holger Dette,et al.  Estimating the variance in nonparametric regression—what is a reasonable choice? , 1998 .

[36]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[37]  Arthur Cohen,et al.  All Admissible Linear Estimates of the Mean Vector , 1966 .

[38]  A. Dalalyan,et al.  Sharp Oracle Inequalities for Aggregation of Affine Estimators , 2011, 1104.3969.

[39]  C. L. Mallows Some comments on C_p , 1973 .

[40]  Andrew R. Barron,et al.  Information Theory and Mixing Least-Squares Regressions , 2006, IEEE Transactions on Information Theory.

[41]  R. Adamczak,et al.  A note on the Hanson-Wright inequality for random vectors with dependencies , 2014, 1409.8457.

[42]  Ker-Chau Li,et al.  Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing , 1986 .

[43]  Tong Zhang,et al.  Deviation Optimal Learning using Greedy Q-aggregation , 2012, ArXiv.

[44]  Francis R. Bach,et al.  Data-driven calibration of linear estimators with minimal penalties , 2009, NIPS.