phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data

Background The analysis of microbial communities through DNA sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data. Results Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research. Conclusions The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.

[1]  A. L. V. D. Wollenberg Redundancy analysis an alternative for canonical correlation analysis , 1977 .

[2]  D. Chessel,et al.  From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. , 2004, Journal of theoretical biology.

[3]  P. Fayers,et al.  The Visual Display of Quantitative Information , 1990 .

[4]  Klaus Peter Schliep,et al.  phangorn: phylogenetic analysis in R , 2010, Bioinform..

[5]  R. Knight,et al.  Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers , 2008, Nucleic acids research.

[6]  Robert Gentleman,et al.  Statistical Analyses and Reproducible Research , 2007 .

[7]  Philip Hugenholtz,et al.  NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes , 2006, Nucleic Acids Res..

[8]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[9]  P. Legendre,et al.  vegan : Community Ecology Package. R package version 1.8-5 , 2007 .

[10]  Marti J. Anderson,et al.  Multivariate dispersion as a measure of beta diversity. , 2006, Ecology letters.

[11]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[12]  Jun Ma,et al.  The Genboree Microbiome Toolset and the analysis of 16S rRNA microbial sequences , 2012, BMC Bioinformatics.

[13]  Hadley Wickham,et al.  Reshaping Data with the reshape Package , 2007 .

[14]  Peter R. Minchin,et al.  An evaluation of the relative robustness of techniques for ecological ordination , 1987, Vegetatio.

[15]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[16]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[17]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[18]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[19]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[20]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[21]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[22]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[23]  David R. Riley,et al.  CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing , 2011, BMC Bioinformatics.

[24]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[25]  Satwik Rajaram,et al.  NeatMap - non-clustering heat map alternatives in R , 2010, BMC Bioinformatics.

[26]  Susan P. Holmes,et al.  phyloseq: A Bioconductor Package for Handling and Analysis of High-Throughput Phylogenetic Sequence Data , 2011, Pacific Symposium on Biocomputing.

[27]  L. Raskin,et al.  PCR Biases Distort Bacterial and Archaeal Community Structure in Pyrosequencing Datasets , 2012, PloS one.

[28]  John M. Chambers,et al.  Software for data analysis , 2008 .

[29]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[30]  Campbell O. Webb,et al.  Picante: R tools for integrating phylogenies and ecology , 2010, Bioinform..

[31]  Jean Thioulouse,et al.  Simultaneous analysis of a sequence of paired ecological tables: A comparison of several methods , 2011, 1202.5473.

[32]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[33]  Z. Merali Computational science: ...Error , 2010, Nature.

[34]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[35]  R. Clarke,et al.  Theory and Applications of Correspondence Analysis , 1985 .

[36]  Victoria Stodden,et al.  Reproducible Research Concepts and Tools for Cancer Bioinformatics , 2010 .

[37]  William A. Walters,et al.  Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample , 2010, Proceedings of the National Academy of Sciences.

[38]  Darrel C. Ince,et al.  The case for open computer programs , 2012, Nature.

[39]  Leland Wilkinson,et al.  The Grammar of Graphics (Statistics and Computing) , 2005 .

[40]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[41]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[42]  Friedrich Leisch,et al.  Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis , 2002, COMPSTAT.

[43]  James A. Foster,et al.  mcaGUI: microbial community analysis R-Graphical User Interface (GUI) , 2012, Bioinform..

[44]  N. Pace A molecular view of microbial diversity and the biosphere. , 1997, Science.

[45]  R. Knight,et al.  Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex , 2008, Nature Methods.

[46]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[47]  Nick Barnes Publish your computer code: it is good enough , 2010, Nature.

[48]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[49]  C. Braak Canonical Correspondence Analysis: A New Eigenvector Technique for Multivariate Direct Gradient Analysis , 1986 .

[50]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[51]  Robert Gentleman,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[52]  Robert Gentleman,et al.  Reproducible Research: A Bioinformatics Case Study , 2005, Statistical applications in genetics and molecular biology.

[53]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[54]  S. Holmes,et al.  Bootstrapping Phylogenetic Trees: Theory and Methods , 2003 .

[55]  Jonathan A. Eisen,et al.  PhylOTU: A High-Throughput Procedure Quantifies Microbial Community Diversity and Resolves Novel Taxa from Metagenomic Data , 2011, PLoS Comput. Biol..

[56]  R. Knight,et al.  Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities , 2007, Applied and Environmental Microbiology.

[57]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[58]  M. Hill,et al.  Detrended correspondence analysis: An improved ordination technique , 2004, Vegetatio.

[59]  M. Greenacre Correspondence analysis in practice , 1993 .

[60]  James A. Foster,et al.  OTUbase: an R infrastructure package for operational taxonomic unit data , 2011, Bioinform..

[61]  R. Knight,et al.  Fast UniFrac: Facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data , 2009, The ISME Journal.

[62]  P. Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[63]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[64]  Andreas Wilke,et al.  The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome , 2012, GigaScience.

[65]  Rick L. Stevens,et al.  Unlocking the potential of metagenomics through replicated experimental design , 2012, Nature Biotechnology.

[66]  Edward R. Tufte,et al.  The Visual Display of Quantitative Information , 1986 .

[67]  Austin G. Davis-Richardson,et al.  PANGEA: pipeline for analysis of next generation amplicons , 2010, The ISME Journal.

[68]  H. L. Sanders,et al.  Marine Benthic Diversity: A Comparative Study , 1968, The American Naturalist.

[69]  Alfred M. Spormann,et al.  W1926 Shifts in Luminal and Mucosal Microbial Communities Associated With an Experimental Model of Irritable Bowel Syndrome , 2010 .

[70]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[71]  Mark Bailey,et al.  The Grammar of Graphics , 2007, Technometrics.

[72]  John M. Chambers,et al.  Software for Data Analysis: Programming with R , 2008 .

[73]  Alexander V. Alekseyenko,et al.  Visualization and Statistical Comparisons of Microbial Communities Using R Packages on Phylochip Data , 2011, Pacific Symposium on Biocomputing.

[74]  Daniel P. Faith,et al.  Compositional dissimilarity as a robust measure of ecological distance , 1987, Vegetatio.

[75]  David L Donoho,et al.  An invitation to reproducible computational research. , 2010, Biostatistics.

[76]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[77]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[78]  Hadley Wickham,et al.  The Split-Apply-Combine Strategy for Data Analysis , 2011 .

[79]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[80]  M. Adams,et al.  Shotgun Sequencing of the Human Genome , 1998, Science.

[81]  Roger D Peng,et al.  Reproducible research and Biostatistics. , 2009, Biostatistics.

[82]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .