Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples

Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them. We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Tim Hesterberg,et al.  Control variates and importance sampling for efficient bootstrap simulations , 1996, Stat. Comput..

[3]  Hiroshi Mori,et al.  Comparative Metagenomics Revealed Commonly Enriched Gene Sets in Human Gut Microbiomes , 2007, DNA research : an international journal for rapid publication of reports on genes and genomes.

[4]  J. Handelsman,et al.  Introducing TreeClimber, a Test To Compare Microbial Community Structures , 2006, Applied and Environmental Microbiology.

[5]  P. Turnbaugh,et al.  Microbial ecology: Human gut microbes associated with obesity , 2006, Nature.

[6]  Jun Lu,et al.  BMC Bioinformatics BioMed Central Methodology article Identifying differential expression in multiple SAGE libraries: an , 2005 .

[7]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[9]  J. Handelsman,et al.  Introducing SONS, a Tool for Operational Taxonomic Unit-Based Comparisons of Microbial Community Memberships and Structures , 2006, Applied and Environmental Microbiology.

[10]  M. Johns Importance Sampling for Bootstrap Confidence Intervals , 1988 .

[11]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[12]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[13]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[14]  M F Picciano,et al.  The influence of feeding regimens on iron status during infancy. , 1980, The American journal of clinical nutrition.

[15]  M. Pop,et al.  Metagenomic Analysis of the Human Distal Gut Microbiome , 2006, Science.

[16]  Jo Handelsman,et al.  Metagenomics for studying unculturable microorganisms: cutting the Gordian knot , 2005, Genome Biology.

[17]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[18]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[19]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[20]  Eric R. Houpt,et al.  Microbial Inhabitants of Humans: Their Ecology and Role in Health and Disease:Microbial Inhabitants of Humans: Their Ecology and Role in Health and Disease , 2004 .

[21]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[22]  Jo Handelsman,et al.  Integration of Microbial Ecology and Statistics: a Test To Compare Gene Libraries , 2004, Applied and Environmental Microbiology.

[23]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[24]  J. Handelsman,et al.  Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[25]  Daniel B. DiGiulio,et al.  Development of the Human Infant Intestinal Microbiota , 2007, PLoS biology.

[26]  Forest Rohwer,et al.  An application of statistics to comparative metagenomics , 2006, BMC Bioinformatics.

[27]  J. Ruijter,et al.  Statistical evaluation of SAGE libraries: consequences for experimental design. , 2002, Physiological genomics.

[28]  E. Purdom,et al.  Diversity of the Human Intestinal Microbial Flora , 2005, Science.

[29]  Stephen L. Rathbun,et al.  Quantitative Comparisons of 16S rRNA Gene Sequence Libraries from Environmental Samples , 2001, Applied and Environmental Microbiology.

[30]  Rick L. Stevens,et al.  Functional metagenomic profiling of nine biomes , 2008, Nature.

[31]  Elaine Tuomanen Appreciating Our Usual Guests , 2005, Science.

[32]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[33]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[34]  Naryttza N. Diaz,et al.  Phylogenetic classification of short environmental DNA fragments , 2008, Nucleic acids research.

[35]  Elisabeth M Bik,et al.  Molecular analysis of the bacterial microbiota in the human stomach. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[36]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[37]  Susan M. Huse,et al.  Microbial diversity in the deep sea and the underexplored “rare biosphere” , 2006, Proceedings of the National Academy of Sciences.

[38]  Russ B. Altman,et al.  Nonparametric methods for identifying differentially expressed genes in microarray data , 2002, Bioinform..

[39]  Mihai Pop,et al.  Microbiome Metagenomic Analysis of the Human Distal Gut , 2009 .