Variance estimation in high-dimensional linear models

The residual variance and the proportion of explained variation are important quantities in many statistical models and model fitting procedures. They play an important role in regression diagnostics and model selection procedures, as well as in determining the performance limits in many problems. In this paper we propose new method-of-moments-based estimators for the residual variance, the proportion of explained variation and other related quantities, such as the l2 signal strength. The proposed estimators are consistent and asymptotically normal in high-dimensional linear models with Gaussian predictors and errors, where the number of predictors d is proportional to the number of observations n; in fact, consistency holds even in settings where d/n → ∞. Existing results on residual variance estimation in high-dimensional linear models depend on sparsity in the underlying signal. Our results require no sparsity assumptions and imply that the residual variance and the proportion of explained variation can be consistently estimated even when d>n and the underlying signal itself is nonestimable. Numerical work suggests that some of our distributional assumptions may be relaxed. A real-data analysis involving gene expression data and single nucleotide polymorphism data illustrates the performance of the proposed methods.

[1]  H. Akaike A new look at the statistical model identification , 1974 .

[2]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[3]  Dean P. Foster,et al.  The risk inflation criterion for multiple regression , 1994 .

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  John D. Storey A direct approach to false discovery rates , 2002 .

[6]  Tiefeng Jiang,et al.  The asymptotic distributions of the largest entries of sample correlation matrices , 2004, math/0406184.

[7]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[8]  Gérard Letac,et al.  All Invariant Moments of the Wishart Distribution , 2004 .

[9]  Gérard Letac,et al.  The Hyperoctahedral Group, Symmetric Group Representations and the Moments of the Real Wishart Distribution , 2005 .

[10]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[11]  D. Koller,et al.  Population genomics of human gene expression , 2007, Nature Genetics.

[12]  Sourav Chatterjee,et al.  Fluctuations of eigenvalues and second order Poincaré inequalities , 2007, 0705.1224.

[13]  G. Pan,et al.  On asymptotics of eigenvectors of large sample covariance matrix , 2007, 0708.1720.

[14]  Larry A. Wasserman,et al.  Statistical Analysis of Semi-Supervised Regression , 2007, NIPS.

[15]  Noureddine El Karoui,et al.  Operator norm consistent estimation of large-dimensional sparse covariance matrices , 2008, 0901.3220.

[16]  P. Bickel,et al.  Regularized estimation of large covariance matrices , 2008, 0803.1909.

[17]  Botond Cseke,et al.  Advances in Neural Information Processing Systems 20 (NIPS 2007) , 2008 .

[18]  Pan Central limit theorem for signal-to-interference ratio of reduced rank linear receiver , 2008 .

[19]  Harrison H. Zhou,et al.  Optimal rates of convergence for covariance matrix estimation , 2010, 1010.3866.

[20]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[21]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[22]  P. Deloukas,et al.  Patterns of Cis Regulatory Variation in Diverse Human Populations , 2012, PLoS genetics.

[23]  Jianqing Fan,et al.  Variance estimation using refitted cross‐validation in ultrahigh dimensional regression , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[24]  Lee H. Dicker,et al.  Optimal equivariant prediction for high-dimensional linear models with arbitrary predictor covariance , 2013 .

[25]  Harrison H. Zhou,et al.  Optimal rates of convergence for estimating Toeplitz covariance matrices , 2013 .