PIVOTAL ESTIMATION OF NONPARAMETRIC FUNCTIONS VIA SQUARE-ROOT LASSO

In a nonparametric linear regression model we study a variant of LASSO, called p LASSO, which does not require the knowledge of the scaling parameter σ of the noise or bounds for it. This work derives new finite sample upper bounds for prediction norm rate of convergence, l1-rate of converge, l∞-rate of convergence, and sparsity of the p LASSO estimator. A lower bound for the prediction norm rate of convergence is also established. In many non-Gaussian noise cases, we rely on moderate deviation theory for self- normalized sums and on new data-dependent empirical process inequalities to achieve Gaussian-like results provided log p = o(n 1/3 ) improving upon results derived in the para- metric case that required log p . log n. In addition, we derive finite sample bounds on the performance of ordinary least square (OLS) applied tom the model selected by p LASSO accounting for possible misspecification of the selected model. In particular, we provide mild conditions under which the rate of convergence of OLS post p LASSO is not worse than p LASSO. We also study two extreme cases: parametric noiseless and nonparametric unbounded variance. p LASSO does have interesting theoretical guarantees for these two extreme cases. For the parametric noiseless case, differently than LASSO, p LASSO is capable of exact recovery. In the unbounded variance case it can still be consistent since its penalty choice does not depend on σ. Finally, we conduct Monte carlo experiments which show that the empirical performance of p LASSO is very similar to the performance of LASSO when σ is known. We also emphasize that p LASSO can be formulated as a convex programming problem and its computation burden is similar to LASSO. We provide theoretical and empirical evidence

[1]  B. V. Bahr,et al.  Inequalities for the $r$th Absolute Moment of a Sum of Random Variables, $1 \leqq r \leqq 2$ , 1965 .

[2]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[3]  H. Rosenthal On the subspaces ofLp(p>2) spanned by sequences of independent random variables , 1970 .

[4]  A. D. Slastnikov,et al.  Limit Theorems for Moderate Deviation Probabilities , 1979 .

[5]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[6]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[7]  Kim-Chuan Toh,et al.  SDPT3 -- A Matlab Software Package for Semidefinite Programming , 1996 .

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  Kim-Chuan Toh,et al.  SDPT3 — a Matlab software package for semidefinite-quadratic-linear programming, version 3.0 , 2001 .

[10]  James Renegar,et al.  A mathematical view of interior-point methods in convex optimization , 2001, MPS-SIAM series on optimization.

[11]  Bing-Yi Jing,et al.  Self-normalized Cramér-type large deviations for independent random variables , 2003 .

[12]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[13]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[14]  Florentina Bunea,et al.  Aggregation and sparsity via 1 penalized least squares , 2006 .

[15]  Yurii Nesterov,et al.  Dual extrapolation and its applications to solving variational inequalities and related problems , 2003, Math. Program..

[16]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[17]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[18]  A. Tsybakov,et al.  Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[19]  S. Geer HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO , 2008, 0804.0703.

[20]  Karim Lounici Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators , 2008, 0801.4610.

[21]  Zhaosong Lu Gradient based method for cone programming with application to large-scale compressed sensing , 2008 .

[22]  Cun-Hui Zhang,et al.  The sparsity and bias of the Lasso selection in high-dimensional linear regression , 2008, 0808.0967.

[23]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[24]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[25]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[26]  Massimiliano Pontil,et al.  Taking Advantage of Sparsity in Multi-Task Learning , 2009, COLT.

[27]  E. Candès,et al.  Near-ideal model selection by ℓ1 minimization , 2008, 0801.0345.

[28]  V. Koltchinskii Sparsity in penalized empirical risk minimization , 2009 .

[29]  A. Belloni,et al.  L1-Penalized Quantile Regression in High Dimensional Sparse Models , 2009, 0904.2931.

[30]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[31]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[32]  Tze Leung Lai,et al.  Self-Normalized Processes , 2009 .

[33]  A. Belloni,et al.  Post-l1-penalized estimators in high-dimensional linear regression models , 2010 .

[34]  A. Tsybakov,et al.  Sparse recovery under matrix uncertainty , 2008, 0812.2818.

[35]  Emmanuel J. Candès,et al.  Templates for convex cone problems with applications to sparse signal recovery , 2010, Math. Program. Comput..

[36]  A. Belloni,et al.  Square-Root Lasso: Pivotal Recovery of Sparse Signals via Conic Programming , 2011 .

[37]  Kim-Chuan Toh,et al.  On the Implementation and Usage of SDPT3 – A Matlab Software Package for Semidefinite-Quadratic-Linear Programming, Version 4.0 , 2012 .