Adaptive minimax regression estimation over sparse lq-hulls

Given a dictionary of Mn predictors, in a random design regression setting with n observations, we construct estimators that target the best performance among all the linear combinations of the predictors under a sparse lq-norm (0 ≤ q ≤ 1) constraint on the linear coefficients. Besides identifying the optimal rates of convergence, our universal aggregation strategies by model mixing achieve the optimal rates simultaneously over the full range of 0 ≤ q ≤ 1 for any Mn and without knowledge of the lq-norm of the best linear coefficients to represent the regression function. To allow model misspecification, our upper bound results are obtained in a framework of aggregation of estimates. A striking feature is that no specific relationship among the predictors is needed to achieve the upper rates of convergence (hence permitting basically arbitrary correlations between the predictors). Therefore, whatever the true regression function (assumed to be uniformly bounded), our estimators automatically exploit any sparse representation of the regression function (if any), to the best extent possible within the lq-constrained linear combinations for any 0 ≤ q ≤ 1. A sparse approximation result in the lq-hulls turns out to be crucial to adaptively achieve minimax rate optimal aggregation. It precisely characterizes the number of terms needed to achieve a prescribed accuracy of approximation to the best linear combination in an lq-hull for 0 ≤ q ≤ 1. It offers the insight that the minimax rate of lq-aggregation is basically determined by an effective model size, which is a sparsity index that depends on q, Mn, n, and the lq-norm bound in an easily interpretable way based on a classical model selection theory that deals with a large number of models.

[1]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[2]  W. D. Evans FUNCTION SPACES, ENTROPY NUMBERS AND DIFFERENTIAL OPERATORS (Cambridge Tracts in Mathematics 120) By David E. Edmunds and Hans Triebel: 252 pp., £40.00, ISBN 0 521 56036 5 (Cambridge University Press, 1996). , 1998 .

[3]  O. Catoni Challenging the empirical mean and empirical variance: a deviation study , 2010, 1009.2048.

[4]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[5]  A. Tsybakov,et al.  Mirror averaging with sparsity priors , 2010, 1003.1189.

[6]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[7]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[8]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[9]  Arnak S. Dalalyan,et al.  Mirror averaging with sparsity priors , 2010, 1003.1189.

[10]  Yuhong Yang Combining Different Procedures for Adaptive Regression , 2000, Journal of Multivariate Analysis.

[11]  Olivier Catoni,et al.  Statistical learning theory and stochastic optimization , 2004 .

[12]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[13]  Yuhong Yang MODEL SELECTION FOR NONPARAMETRIC REGRESSION , 1997 .

[14]  Yuhong Yang Aggregating regression procedures to improve performance , 2004 .

[15]  L. Birge,et al.  Model selection via testing: an alternative to (penalized) maximum likelihood estimators , 2006 .

[16]  Yuhong Yang,et al.  An Asymptotic Property of Model Selection Criteria , 1998, IEEE Trans. Inf. Theory.

[17]  Alexandre B. Tsybakov,et al.  Optimal Rates of Aggregation , 2003, COLT.

[18]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[19]  L. Birge,et al.  On estimating a density using Hellinger distance and some other strange facts , 1986 .

[20]  A. Juditsky,et al.  Functional aggregation for nonparametric regression , 2000 .

[21]  Y. Baraud Model selection for regression on a fixed design , 2000 .

[22]  P. Massart,et al.  Gaussian model selection , 2001 .

[23]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[24]  Y. Baraud Model selection for regression on a random design , 2002 .

[25]  A. Dalalyan,et al.  Sharp Oracle Inequalities for Aggregation of Affine Estimators , 2011, 1104.3969.

[26]  Arnak S. Dalalyan,et al.  Sparse Regression Learning by Aggregation and Langevin Monte-Carlo , 2009, COLT.

[27]  L. Birge Model selection for Gaussian regression with random design , 2004 .

[28]  A. Goldenshluger A universal procedure for aggregating estimators , 2007, 0704.2500.

[29]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[30]  Sandra Paterlini,et al.  Adaptive Minimax Estimation over Sparse l q-Hulls , 2011, 1108.1961.

[31]  Andrew B. Nobel,et al.  Sequential Procedures for Aggregating Arbitrary Estimators of a Conditional Mean , 2008, IEEE Transactions on Information Theory.

[32]  D. Donoho,et al.  Minimax risk over / p-balls for / q-error , 2022 .

[33]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[34]  Jean-Yves Audibert Fast learning rates in statistical inference through aggregation , 2007, math/0703854.

[35]  Andrew R. Barron,et al.  Information Theory and Mixing Least-Squares Regressions , 2006, IEEE Transactions on Information Theory.

[36]  L. Gyorfi,et al.  Sequential Prediction of Unbounded Stationary Time Series , 2007, IEEE Transactions on Information Theory.

[37]  Karim Lounici Generalized mirror averaging and D-convex aggregation , 2007 .

[38]  Yuhong Yang Mixing Strategies for Density Estimation , 2000 .

[39]  Lucien Birg'e Model selection for density estimation with L2-loss , 2008, 0808.1416.

[40]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[41]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[42]  Christophe Giraud,et al.  Mixing Least-Squares Estimators when the Variance is Unknown , 2007, 0711.0372.

[43]  M. Wegkamp Model selection in nonparametric regression , 2003 .

[44]  Yuhong Yang Adaptive Regression by Mixing , 2001 .

[45]  G. Pisier Remarques sur un résultat non publié de B. Maurey , 1981 .

[46]  Arkadi Nemirovski,et al.  Topics in Non-Parametric Statistics , 2000 .

[47]  Lucien Birgé,et al.  Model selection for density estimation with $$\mathbb L _2$$L2-loss , 2014 .

[48]  A. Tsybakov,et al.  Exponential Screening and optimal rates of sparse estimation , 2010, 1003.2654.

[49]  Jean-Yves Audibert No fast exponential deviation inequalities for the progressive mixture rule , 2007 .

[50]  A. Juditsky,et al.  Learning by mirror averaging , 2005, math/0511468.

[51]  David L Donoho,et al.  Compressed sensing , 2006, IEEE Transactions on Information Theory.

[52]  Arnak S. Dalalyan,et al.  Aggregation by Exponential Weighting and Sharp Oracle Inequalities , 2007, COLT.

[53]  Jean-Yves Audibert,et al.  Risk bounds in linear regression through PAC-Bayesian truncation , 2009, 0902.1733.

[54]  I. Johnstone,et al.  Minimax risk overlp-balls forlp-error , 1994 .

[55]  O. Catoni The Mixture Approach to Universal Model Selection , 1997 .

[56]  Thomas Kühn,et al.  A Lower Estimate for Entropy Numbers , 2001, J. Approx. Theory.