Inferring Correlation Networks from Genomic Survey Data

High-throughput sequencing based techniques, such as 16S rRNA gene profiling, have the potential to elucidate the complex inner workings of natural microbial communities - be they from the world's oceans or the human gut. A key step in exploring such data is the identification of dependencies between members of these communities, which is commonly achieved by correlation analysis. However, it has been known since the days of Karl Pearson that the analysis of the type of data generated by such techniques (referred to as compositional data) can produce unreliable results since the observed data take the form of relative fractions of genes or species, rather than their absolute abundances. Using simulated and real data from the Human Microbiome Project, we show that such compositional effects can be widespread and severe: in some real data sets many of the correlations among taxa can be artifactual, and true correlations may even appear with opposite sign. Additionally, we show that community diversity is the key factor that modulates the acuteness of such compositional effects, and develop a new approach, called SparCC (available at https://bitbucket.org/yonatanf/sparcc), which is capable of estimating correlation values from compositional data. To illustrate a potential application of SparCC, we infer a rich ecological network connecting hundreds of interacting species across 18 sites on the human body. Using the SparCC network as a reference, we estimated that the standard approach yields 3 spurious species-species interactions for each true interaction and misses 60% of the true interactions in the human microbiome data, and, as predicted, most of the erroneous links are found in the samples with the lowest diversity.

[1]  Susan M. Huse,et al.  Ironing out the wrinkles in the rare biosphere through improved OTU clustering , 2010, Environmental microbiology.

[2]  Patrick D. Schloss,et al.  Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based Studies , 2011, PloS one.

[3]  V. Pawlowsky-Glahn,et al.  Dealing with Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation , 2003 .

[4]  Julian Parkhill,et al.  Microbiology in the post-genomic era , 2008, Nature Reviews Microbiology.

[5]  V. Pawlowsky-Glahn,et al.  Compositional data analysis : theory and applications , 2011 .

[6]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[7]  P. Filzmoser,et al.  Correlation Analysis for Compositional Data , 2009 .

[8]  R. M. Lark,et al.  Compositional Data Analysis in the Geosciences: from Theory to Practice , 2008 .

[9]  Donald A. Jackson COMPOSITIONAL DATA IN COMMUNITY ECOLOGY: THE PARADIGM OR PERIL OF PROPORTIONS? , 1997 .

[10]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[11]  Alan Agresti,et al.  Bayesian inference for categorical data analysis , 2005, Stat. Methods Appl..

[12]  John C. Butler,et al.  Complete subcompositional independence testing of closed arrays , 1985 .

[13]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[14]  R. Knight,et al.  Microbial community resemblance methods differ in their ability to detect biologically relevant patterns , 2010, Nature Methods.

[15]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[16]  A. Chao,et al.  Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample , 2004, Environmental and Ecological Statistics.

[17]  B. Haas,et al.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. , 2011, Genome research.

[18]  K. Pearson Mathematical contributions to the theory of evolution.—On a form of spurious correlation which may arise when indices are used in the measurement of organs , 1897, Proceedings of the Royal Society of London.

[19]  J. Aitchison A new approach to null correlations of proportions , 1981 .

[20]  Korbinian Strimmer,et al.  Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks , 2008, J. Mach. Learn. Res..

[21]  J. Aitchison,et al.  Compositional Data Analysis: Where Are We and Where Should We Be Heading? , 2003 .

[22]  Josep Antoni Martín-Fernández,et al.  Compositional VARIMA Time Series , 2011 .

[23]  L. Forney,et al.  The tragedy of the uncommon: understanding limitations in the analysis of microbial diversity , 2008, The ISME Journal.

[24]  H. Ochman,et al.  Illumina-based analysis of microbial community diversity , 2011, The ISME Journal.

[25]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[26]  Jeremiah J. Faith,et al.  Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata , 2007, Nucleic Acids Res..

[27]  L. Jost Entropy and diversity , 2006 .

[28]  J. Aitchison On criteria for measures of compositional difference , 1992 .

[29]  Lynn K. Carmichael,et al.  Evaluation of 16S rDNA-Based Community Profiling for Human Microbiome Research , 2012, PloS one.

[30]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008 .