Hub Discovery in Partial Correlation Graphs

One of the most important problems in large-scale inference problems is the identification of variables that are highly dependent on several other variables. When dependence is measured by partial correlations, these variables identify those rows of the partial correlation matrix that have several entries with large magnitudes, i.e., hubs in the associated partial correlation graph. This paper develops theory and algorithms for discovering such hubs from a few observations of these variables. We introduce a hub screening framework in which the user specifies both a minimum (partial) correlation <formula formulatype="inline"><tex Notation="TeX">$\rho $</tex></formula> and a minimum degree <formula formulatype="inline"><tex Notation="TeX">$\delta $</tex></formula> to screen the vertices. The choice of <formula formulatype="inline"><tex Notation="TeX">$\rho $</tex></formula> and <formula formulatype="inline"><tex Notation="TeX">$\delta $</tex></formula> can be guided by our mathematical expressions for the phase transition correlation threshold <formula formulatype="inline"><tex Notation="TeX">$\rho _{c}$</tex></formula> governing the average number of discoveries. They can also be guided by our asymptotic expressions for familywise discovery rates under the assumption of large number <formula formulatype="inline"><tex Notation="TeX">$p$</tex> </formula> of variables, fixed number <formula formulatype="inline"><tex Notation="TeX">$n$</tex></formula> of multivariate samples, and weak dependence. Under the null hypothesis that the dispersion (covariance) matrix is sparse, these limiting expressions can be used to enforce familywise error constraints and to rank the discoveries in order of increasing statistical significance. For <formula formulatype="inline"><tex Notation="TeX">$n\ll p$</tex></formula>, the computational complexity of the proposed partial correlation screening method is low and is therefore highly scalable. Thus, it can be applied to significantly larger problems than previous approaches. The theory is applied to discovering hubs in a high-dimensional gene microarray dataset.

[1]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[2]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[4]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[5]  Venkat Chandrasekaran,et al.  Feedback message passing for inference in gaussian graphical models , 2010, 2010 IEEE International Symposium on Information Theory.

[6]  Susmita Datta,et al.  A statistical framework for differential network analysis from microarray data , 2010, BMC Bioinformatics.

[7]  Pei Wang,et al.  Partial Correlation Estimation by Joint Sparse Regression Models , 2008, Journal of the American Statistical Association.

[8]  Howard Y. Chang,et al.  Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[9]  A. F. Smith,et al.  Ridge-Type Estimators for Regression Analysis , 1974 .

[10]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[11]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[12]  Carlos M. Carvalho,et al.  FLEXIBLE COVARIANCE ESTIMATION IN GRAPHICAL GAUSSIAN MODELS , 2008, 0901.3267.

[13]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[14]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[15]  Alfred O. Hero,et al.  Covariance Estimation in Decomposable Gaussian Graphical Models , 2009, IEEE Transactions on Signal Processing.

[16]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[17]  Vasyl Pihur,et al.  Reconstruction of genetic association networks from microarray data: a partial least squares approach , 2008, Bioinform..

[18]  Ananthram Swami,et al.  Detection of Gauss–Markov Random Fields With Nearest-Neighbor Dependency , 2007, IEEE Transactions on Information Theory.

[19]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[20]  A. Hero,et al.  Large-Scale Correlation Screening , 2011, 1102.1204.

[21]  Alfred O. Hero,et al.  A local dependence measure and its application to screening for high correlations in large data sets , 2011, 14th International Conference on Information Fusion.

[22]  Nanny Wermuth,et al.  Multivariate Dependencies: Models, Analysis and Interpretation , 1996 .

[23]  Anne-Laure Boulesteix,et al.  Regularized estimation of large-scale gene association networks using graphical Gaussian models , 2009, BMC Bioinformatics.