High‐Dimensional Heteroscedastic Regression with an Application to eQTL Data Analysis

We consider the problem of high-dimensional regression under non-constant error variances. Despite being a common phenomenon in biological applications, heteroscedasticity has, so far, been largely ignored in high-dimensional analysis of genomic data sets. We propose a new methodology that allows non-constant error variances for high-dimensional estimation and model selection. Our method incorporates heteroscedasticity by simultaneously modeling both the mean and variance components via a novel doubly regularized approach. Extensive Monte Carlo simulations indicate that our proposed procedure can result in better estimation and variable selection than existing methods when heteroscedasticity arises from the presence of predictors explaining error variances and outliers. Further, we demonstrate the presence of heteroscedasticity in and apply our method to an expression quantitative trait loci (eQTLs) study of 112 yeast segregants. The new procedure can automatically account for heteroscedasticity in identifying the eQTLs that are associated with gene expression variations and lead to smaller prediction errors. These results demonstrate the importance of considering heteroscedasticity in eQTL data analysis.

[1]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[2]  Gene H. Golub,et al.  Generalized cross-validation as a method for choosing a good ridge parameter , 1979, Milestones in Matrix Computation.

[3]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[4]  Ji Zhu,et al.  L1-Norm Quantile Regression , 2008 .

[5]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[6]  Decision Systems.,et al.  Coordinate ascent for maximizing nondifferentiable concave functions , 1988 .

[7]  Arnoldo Frigessi,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm305 Gene expression Predicting survival from microarray data—a comparative study , 2022 .

[8]  R. Koenker Quantile Regression: Name Index , 2005 .

[9]  D. Ruppert,et al.  Transformation and Weighting in Regression , 1988 .

[10]  George E. P. Box,et al.  Correcting Inhomogeneity of Variance with Power Transformation Weighting , 1974 .

[11]  Rachel B. Brem,et al.  The landscape of genetic complexity across 5,700 gene expression traits in yeast. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[12]  T. Breurch,et al.  A simple test for heteroscedasticity and random coefficient variation (econometrica vol 47 , 1979 .

[13]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[14]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[15]  S. Weisberg,et al.  Diagnostics for heteroscedasticity in regression , 1983 .

[16]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[17]  David Ruppert,et al.  Robust Estimation in Heteroscedastic Linear Models. , 1982 .

[18]  David A. Drubin,et al.  Learning a Prior on Regulatory Potential from eQTL Data , 2009, PLoS genetics.

[19]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[20]  Michael D. Gordon,et al.  Regularized Least Absolute Deviations Regression and an Efficient Algorithm for Parameter Tuning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  William S. Cleveland,et al.  Visualizing Data , 1993 .

[22]  Jeremy MG Taylor,et al.  Robust Statistical Modeling Using the t Distribution , 1989 .

[23]  Hansheng Wang,et al.  Robust Regression Shrinkage and Consistent Variable Selection Through the LAD-Lasso , 2007 .

[24]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[25]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[26]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[27]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[28]  Jinfeng Xu,et al.  Simultaneous estimation and variable selection in median regression using Lasso-type penalty , 2010, Annals of the Institute of Statistical Mathematics.

[29]  David Ruppert,et al.  The Effect of Estimating Weights in Weighted Least Squares , 1988 .

[30]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[31]  Yiyuan She,et al.  Outlier Detection Using Nonconvex Penalized Regression , 2010, ArXiv.

[32]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[33]  Yufeng Liu,et al.  VARIABLE SELECTION IN QUANTILE REGRESSION , 2009 .

[34]  Ker-Chau Li,et al.  A system for enhancing genome-wide coexpression dynamics study. , 2004, Proceedings of the National Academy of Sciences of the United States of America.