From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data

BackgroundThe use of correlation networks is widespread in the analysis of gene expression and proteomics data, even though it is known that correlations not only confound direct and indirect associations but also provide no means to distinguish between cause and effect. For "causal" analysis typically the inference of a directed graphical model is required. However, this is rather difficult due to the curse of dimensionality.ResultsWe propose a simple heuristic for the statistical learning of a high-dimensional "causal" network. The method first converts a correlation network into a partial correlation graph. Subsequently, a partial ordering of the nodes is established by multiple testing of the log-ratio of standardized partial variances. This allows identifying a directed acyclic causal network as a subgraph of the partial correlation network. We illustrate the approach by analyzing a large Arabidopsis thaliana expression data set.ConclusionThe proposed approach is a heuristic algorithm that is based on a number of approximations, such as substituting lower order partial correlations by full order partial correlations. Nevertheless, for small samples and for sparse networks the algorithm not only yield sensible first order approximations of the causal structure in high-dimensional genomic data but is also computationally highly efficient.Availability and RequirementsThe method is implemented in the "GeneNet" R package (version 1.2.0), available from CRAN and from http://strimmerlab.org/software/genets/. The software includes an R script for reproducing the network analysis of the Arabidopsis thaliana data.

[1]  Robert Castelo,et al.  A Robust Procedure For Gaussian Graphical Model Search From Microarray Data With p Larger Than n , 2006, J. Mach. Learn. Res..

[2]  Holger Schwender,et al.  Bibliography Reverse Engineering Genetic Networks Using the Genenet Package , 2006 .

[3]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[4]  Korbinian Strimmer,et al.  Identifying periodically expressed transcripts in microarray time series data , 2008, Bioinform..

[5]  C.J.H. Mann,et al.  Probabilistic Conditional Independence Structures , 2005 .

[6]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[7]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[8]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Korbinian Strimmer,et al.  Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process , 2007, BMC Bioinformatics.

[10]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[11]  N. Wermuth,et al.  Linear Dependencies Represented by Chain Graphs , 1993 .

[12]  Hongzhe Li,et al.  Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. , 2006, Biostatistics.

[13]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[14]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[15]  Aapo Hyvärinen,et al.  A Linear Non-Gaussian Acyclic Model for Causal Discovery , 2006, J. Mach. Learn. Res..

[16]  M Tumminello,et al.  A tool for filtering information in complex systems. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Rosario N. Mantegna,et al.  Book Review: An Introduction to Econophysics, Correlations, and Complexity in Finance, N. Rosario, H. Mantegna, and H. E. Stanley, Cambridge University Press, Cambridge, 2000. , 2000 .

[18]  Marco Grzegorczyk,et al.  Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks , 2006, Bioinform..

[19]  David Thorneycroft,et al.  Diurnal Changes in the Transcriptome Encoding Enzymes of Starch Metabolism Provide Evidence for Both Transcriptional and Posttranscriptional Regulation of Starch Metabolism in Arabidopsis Leaves1 , 2004, Plant Physiology.

[20]  Mary Anderson RI map database and mapping service moves to Nottingham Arabidopsis Stock Centre , 2007, Plant Molecular Biology Reporter.

[21]  Alberto de la Fuente,et al.  Discovery of meaningful associations in genomic data using partial correlation coefficients , 2004, Bioinform..

[22]  K. Kaski,et al.  Clustering and information in correlation based financial networks , 2003, cond-mat/0312682.

[23]  P. Spirtes,et al.  Causation, Prediction, and Search, 2nd Edition , 2001 .

[24]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[25]  Peter Bühlmann,et al.  Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm , 2007, J. Mach. Learn. Res..

[26]  Panos M. Pardalos,et al.  Statistical analysis of financial networks , 2005, Comput. Stat. Data Anal..

[27]  Korbinian Strimmer INFERRING GENE DEPENDENCY NETWORKS FROM GENOMIC LONGITUDINAL DATA : A FUNCTIONAL DATA APPROACH , 2006 .

[28]  C. Robert Kenley,et al.  Gaussian influence diagrams , 1989 .

[29]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[30]  P. Bühlmann,et al.  Statistical Applications in Genetics and Molecular Biology Low-Order Conditional Independence Graphs for Inferring Genetic Networks , 2011 .

[31]  G. Stewart Collinearity and Least Squares Regression , 1987 .

[32]  S. Horvath,et al.  Conservation and evolution of gene coexpression networks in human and chimpanzee brains , 2006, Proceedings of the National Academy of Sciences.

[33]  Constantin F. Aliferis,et al.  The max-min hill-climbing Bayesian network structure learning algorithm , 2006, Machine Learning.

[34]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[35]  Ralf Steuer,et al.  Review: On the analysis and interpretation of correlations in metabolomic data , 2006, Briefings Bioinform..

[36]  D. J. Allerton,et al.  Book Review: GPS theory and practice. Second Edition, HOFFMANNWELLENHOFF B., LICHTENEGGER H. and COLLINS J., 1993, 326 pp., Springer, £31.00 pb, ISBN 3-211-82477-4 , 1995 .

[37]  B. Shipley Cause and correlation in biology , 2000 .

[38]  Ulrike Groemping,et al.  Relative Importance for Linear Regression in R: The Package relaimpo , 2006 .

[39]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[40]  R. Fisher 036: On a Distribution Yielding the Error Functions of Several Well Known Statistics. , 1924 .

[41]  Korbinian Strimmer,et al.  USING REGULARIZED DYNAMIC CORRELATION TO INFER GENE DEPENDENCY NETWORKS FROM TIME-SERIES MICROARRAY DATA , 2006 .

[42]  David A. Freedman,et al.  Statistical Models: Theory and Practice: References , 2005 .

[43]  N. Wermuth Linear Recursive Equations, Covariance Selection, and Path Analysis , 1980 .