Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013

Current practice in the normalization of microbiome count data is inefficient in the statistical sense. For apparently historical reasons, the common approach is either to use simple proportions (which does not address heteroscedasticity) or to use rarefying of counts, even though both of these approaches are inappropriate for detection of differentially abundant species. Well-established statistical theory is available that simultaneously accounts for library size differences and biological variability using an appropriate mixture model. Moreover, specific implementations for DNA sequencing read count data (based on a Negative Binomial model for instance) are already available in RNA-Seq focused R packages such as edgeR and DESeq. Here we summarize the supporting statistical theory and use simulations and empirical data to demonstrate substantial improvements provided by a relevant mixture model framework over simple proportions or rarefying. We show how both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes. Regarding microbiome sample-wise clustering, we also show that the rarefying procedure often discards samples that can be accurately clustered by alternative methods. We further compare different Negative Binomial methods with a recently-described zero-inflated Gaussian mixture, implemented in a package called metagenomeSeq. We find that metagenomeSeq performs well when there is an adequate number of biological replicates, but it nevertheless tends toward a higher false positive rate. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have provided microbiome-specific extensions to these tools in the R package, phyloseq.

[1]  R. Knight,et al.  Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers , 2008, Nucleic acids research.

[2]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[3]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[4]  G. Gloor,et al.  High throughput sequencing methods and analysis for microbiome research. , 2013, Journal of microbiological methods.

[5]  Timothy L. Tickle,et al.  Computational meta'omics for microbial community studies , 2013, Molecular systems biology.

[6]  Susan M. Huse,et al.  Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing , 2008, PLoS genetics.

[7]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[8]  Gary L. Andersen,et al.  High-Density Microarray of Small-Subunit Ribosomal DNA Probes , 2002, Applied and Environmental Microbiology.

[9]  Martin Maechler,et al.  Cluster Analysis Extended Rousseeuw et al , 2014 .

[10]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[11]  P. Legendre,et al.  vegan : Community Ecology Package. R package version 1.8-5 , 2007 .

[12]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[13]  R. Knight,et al.  UniFrac: an effective distance metric for microbial community comparison , 2011, The ISME Journal.

[14]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[15]  M. Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.

[16]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[17]  Rob Knight,et al.  Diversity, distribution and sources of bacteria in residential kitchens. , 2013, Environmental microbiology.

[18]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[19]  R. Knight,et al.  Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities , 2007, Applied and Environmental Microbiology.

[20]  Wolfgang Huber,et al.  Shrinkage estimation of dispersion in Negative Binomial models for RNA-seq experiments with small sample size , 2013, Bioinform..

[21]  I. Nookaew,et al.  A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae , 2012, Nucleic acids research.

[22]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[23]  Hongzhe Li,et al.  Disordered Microbial Communities in the Upper Respiratory Tract of Cigarette Smokers , 2010, PloS one.

[24]  Beatriz de la Iglesia,et al.  Clustering Rules: A Comparison of Partitioning and Hierarchical Clustering Algorithms , 2006, J. Math. Model. Algorithms.

[25]  Levi Waldron,et al.  Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples , 2012, Genome Biology.

[26]  J. Handelsman,et al.  Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[27]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[28]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[29]  Daniela M. Witten,et al.  Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[30]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[31]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[32]  Rob Knight,et al.  Advancing our understanding of the human microbiome using QIIME. , 2013, Methods in enzymology.

[33]  H. L. Sanders,et al.  Marine Benthic Diversity: A Comparative Study , 1968, The American Naturalist.

[34]  Robert K. Colwell,et al.  Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness , 2001 .

[35]  David L Donoho,et al.  An invitation to reproducible computational research. , 2010, Biostatistics.

[36]  Matthew A. Zapala,et al.  Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables , 2006, Proceedings of the National Academy of Sciences.

[37]  J. T. Curtis,et al.  An Ordination of the Upland Forest Communities of Southern Wisconsin , 1957 .

[38]  William A. Walters,et al.  Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample , 2010, Proceedings of the National Academy of Sciences.

[39]  Jeff Kline,et al.  Architectural design influences the diversity and structure of the built environment microbiome , 2012, The ISME Journal.

[40]  Koji Kadota,et al.  TCC: an R package for comparing tag count data with robust normalization strategies , 2013, BMC Bioinformatics.

[41]  Patrick D Schloss,et al.  Evaluating different approaches that test whether microbial communities have the same structure , 2008, The ISME Journal.

[42]  Sandrine Dudoit,et al.  Resampling-Based Multiple Hypothesis Testing with Applications to Genomics: New Developments in the R/Bioconductor Package multtest , 2009 .

[43]  Curtis Huttenhower,et al.  A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets , 2013, PLoS Comput. Biol..

[44]  F. Bushman,et al.  Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes , 2011, Science.

[45]  Hadley Wickham,et al.  The Split-Apply-Combine Strategy for Data Analysis , 2011 .

[46]  Jessica J Hellmann,et al.  The application of rarefaction techniques to molecular inventories of microbial diversity. , 2005, Methods in enzymology.

[47]  Maliha Aziz,et al.  The Effects of Circumcision on the Penis Microbiome , 2010, PloS one.

[48]  Hadley Wickham,et al.  Reshaping Data with the reshape Package , 2007 .

[49]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[50]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[51]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[52]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[53]  N. Pace A molecular view of microbial diversity and the biosphere. , 1997, Science.

[54]  R. Knight,et al.  Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex , 2008, Nature Methods.

[55]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[56]  Robert K. Colwell,et al.  ESTIMATION OF SPECIES RICHNESS: MIXTURE MODELS, THE ROLE OF RARE SPECIES, AND INFERENTIAL CHALLENGES , 2005 .

[57]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[58]  Marti J. Anderson,et al.  A new method for non-parametric multivariate analysis of variance in ecology , 2001 .

[59]  Joshua LaBaer,et al.  Reduced Incidence of Prevotella and Other Fermenters in Intestinal Microflora of Autistic Children , 2013, PloS one.

[60]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[61]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[62]  F. J. Anscombe,et al.  THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA , 1948 .

[63]  Jun Lu,et al.  BMC Bioinformatics BioMed Central Methodology article Identifying differential expression in multiple SAGE libraries: an , 2005 .

[64]  J. Clemente,et al.  Human gut microbiome viewed across age and geography , 2012, Nature.

[65]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[66]  Robert Gentleman,et al.  Statistical Analyses and Reproducible Research , 2007 .

[67]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[68]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[69]  Erez Lieberman Aiden,et al.  The expanding scope of DNA sequencing , 2012, Nature Biotechnology.

[70]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[71]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[72]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[73]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[74]  Kun Huang,et al.  DFI: gene feature discovery in RNA-seq experiments from multiple sources , 2012, BMC Genomics.

[75]  Robert K. Colwell,et al.  A new statistical approach for assessing similarity of species composition with incidence and abundance data , 2004 .

[76]  J. T. Wulu,et al.  Regression analysis of count data , 2002 .

[77]  R. Knight,et al.  Fast UniFrac: Facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data , 2009, The ISME Journal.

[78]  Fred A. Wright,et al.  A powerful and flexible approach to the analysis of RNA sequence count data , 2011, Bioinform..

[79]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[80]  Dawn Field,et al.  The seasonal structure of microbial communities in the Western English Channel. , 2009, Environmental microbiology.