Measurement error in Lasso: impact and likelihood bias correction

Regression with the lasso penalty is a popular tool for performing dimension reduction when the number of covariates is large. In many applications of the lasso, like in genomics, covariates are subject to measurement error. We study the impact of measurement error on linear regression with the lasso penalty, both analytically and in simulation experiments. A simple method of correction for measurement error in the lasso is then considered. In the large sample limit, the corrected lasso yields sign consistent covariate selection under conditions very similar to the lasso with perfect measurements, whereas the uncorrected lasso requires much more stringent conditions on the covariance structure of the data. Finally, we suggest methods to correct for measurement error in generalized linear models with the lasso penalty, which we study empirically in simulation experiments with logistic regression, and also apply to a classification problem with microarray data. We see that the corrected lasso selects less false positives than the standard lasso, at a similar level of true positives. The corrected lasso can therefore be used to obtain more conservative covariate selection in genomic analysis.

[1]  A. Tsybakov,et al.  Sparse recovery under matrix uncertainty , 2008, 0812.2818.

[2]  Raymond J. Carroll,et al.  Conditional scores and optimal scores for generalized linear measurement-error models , 1987 .

[3]  Wei Pan,et al.  Incorporating prior knowledge of gene functional groups into regularized discriminant analysis of microarray data , 2007, Bioinform..

[4]  Kung-Yee Liang,et al.  Approximate Likelihoods for Generalized Linear Errors‐in‐variables Models , 1997 .

[5]  T. W. Anderson An Introduction to Multivariate Statistical Analysis, 2nd Edition. , 1985 .

[6]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[7]  Runze Li,et al.  Variable Selection for Partially Linear Models With Measurement Errors , 2009, Journal of the American Statistical Association.

[8]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[9]  H. Cordell,et al.  SNP Selection in Genome-Wide and Candidate Gene Studies via Penalized Logistic Regression , 2010, Genetic epidemiology.

[10]  Qinfeng Xu,et al.  Covariate Selection for Linear Errors-in-Variables Regression Models , 2007 .

[11]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[12]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[13]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[14]  Constantine Caramanis,et al.  Orthogonal Matching Pursuit with Noisy and Missing Data: Low and High Dimensional Results , 2012, ArXiv.

[15]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[16]  Constantine Caramanis,et al.  Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery , 2013, ICML.

[17]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[18]  Wei Pan,et al.  Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms , 2007, Bioinform..

[19]  Y. Benjamini,et al.  Summarizing and correcting the GC content bias in high-throughput sequencing , 2012, Nucleic acids research.

[20]  David M. Rocke,et al.  A Model for Measurement Error for Gene Expression Arrays , 2001, J. Comput. Biol..

[21]  Martin J. Wainwright,et al.  Fast global convergence of gradient methods for high-dimensional statistical recovery , 2011, ArXiv.

[22]  Runze Li,et al.  Variable Selection in Measurement Error Models. , 2010, Bernoulli : official journal of the Bernoulli Society for Mathematical Statistics and Probability.

[23]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[24]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[25]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[26]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[27]  Anne-Mette K. Hein,et al.  BGX: a fully Bayesian integrated approach to the analysis of Affymetrix GeneChip data. , 2005, Biostatistics.

[28]  David J. Biagioni,et al.  Keeping greed good: sparse regression under design uncertainty with application to biomass characterization , 2012, 1207.1888.

[29]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[30]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[31]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[32]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[33]  D. Ruppert,et al.  Measurement Error in Nonlinear Models , 1995 .

[34]  A. Tsybakov,et al.  Improved Matrix Uncertainty Selector , 2011, 1112.4413.

[35]  F. Breidt,et al.  Spatial Lasso With Applications to GIS Model Selection , 2010 .

[36]  E. Purdom,et al.  Statistical Applications in Genetics and Molecular Biology Error Distribution for Gene Expression Data , 2011 .