Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics

Inferring large-scale covariance matrices from sparse genomic data is an ubiquitous problem in bioinformatics. Clearly, the widely used standard covariance and correlation estimators are ill-suited for this purpose. As statistically efficient and computationally fast alternative we propose a novel shrinkage covariance estimator that exploits the Ledoit-Wolf (2003) lemma for analytic calculation of the optimal shrinkage intensity.Subsequently, we apply this improved covariance estimator (which has guaranteed minimum mean squared error, is well-conditioned, and is always positive definite even for small sample sizes) to the problem of inferring large-scale gene association networks. We show that it performs very favorably compared to competing approaches both in simulations as well as in application to real expression data.

[1]  H. Hotelling New Light on the Correlation Coefficient and its Transforms , 1953 .

[2]  C. Stein Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution , 1956 .

[3]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[4]  B. Efron,et al.  Data Analysis Using Stein's Estimator and its Generalizations , 1975 .

[5]  B. Efron Biased Versus Unbiased Estimation , 1975 .

[6]  B. Efron,et al.  Stein's Paradox in Statistics , 1977 .

[7]  Bradley Efron,et al.  Maximum Likelihood and Decision Theory , 1982 .

[8]  C. Morris Parametric Empirical Bayes Inference: Theory and Applications , 1983 .

[9]  Tom Leonard,et al.  Parametric Empirical Bayes Inference: Theory and Applications: Comment , 1983 .

[10]  N. Higham COMPUTING A NEAREST SYMMETRIC POSITIVE SEMIDEFINITE MATRIX , 1988 .

[11]  J. Friedman Regularized Discriminant Analysis , 1989 .

[12]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[13]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[14]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[15]  Pui Lam Leung,et al.  Estimation of the Scale Matrix and its Eigenvalues in the Wishart and the Multivariate F Distributions , 1998 .

[16]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Rosario N. Mantegna,et al.  Book Review: An Introduction to Econophysics, Correlations, and Complexity in Finance, N. Rosario, H. Mantegna, and H. E. Stanley, Cambridge University Press, Cambridge, 2000. , 2000 .

[18]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[20]  S Greenland,et al.  Principles of multilevel modelling. , 2000, International journal of epidemiology.

[21]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[22]  R. Kass,et al.  Shrinkage Estimators for Covariance Matrices , 2001, Biometrics.

[23]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[24]  Hiroyuki Toh,et al.  Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling , 2002, Bioinform..

[25]  John D. Storey A direct approach to false discovery rates , 2002 .

[26]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[28]  Olivier Ledoit,et al.  Honey, I Shrunk the Sample Covariance Matrix , 2003 .

[29]  Olivier Ledoit,et al.  Improved estimation of the covariance matrix of stock returns with an application to portfolio selection , 2003 .

[30]  Paul M. Magwene,et al.  Estimating genomic coexpression networks using first-order conditional independence , 2004, Genome Biology.

[31]  T. Hastie,et al.  Classification of gene microarrays by penalized logistic regression. , 2004, Biostatistics.

[32]  W. Schmidt-Heck,et al.  Reverse Engineering of the Stress Response during Expression of a Recombinant Protein , 2004 .

[33]  Alberto de la Fuente,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004, Bioinform..

[34]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[35]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[36]  D. Cox,et al.  A note on pseudolikelihood constructed from marginal densities , 2004 .

[37]  Trevor Hastie,et al.  Regularized Discriminant Analysis and Its Application in Microarrays , 2004 .

[38]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[39]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[40]  P. Bühlmann,et al.  Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana , 2004, Genome Biology.

[41]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[42]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[43]  Bradley Efron,et al.  Local False Discovery Rates , 2005 .

[44]  Panos M. Pardalos,et al.  Statistical analysis of financial networks , 2005, Comput. Stat. Data Anal..

[45]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[46]  Korbinian Strimmer,et al.  Learning Large‐Scale Graphical Gaussian Models from Genomic Data , 2005 .

[47]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.