Clustering and Classification of Human Microbiome Data: Evaluating the Impact of Different Settings in Bioinformatics Workflows

Microbiome studies are attracting increasing interest, especially in human health applications, where their use for disease prognostics, diagnostics and treatment can have immense effects on life quality. The settings in the microbiome data preprocessing stage can lead to the great variability of the generated operational taxonomic unit (OTU) tables, reflected in the size and sparseness of this data matrix. As there are still no solid guidelines on the best practices, it is valuable to assess which machine learning algorithms provide higher stability of results under variable preprocessing settings. In this study, we have generated OTU tables using data from the Moving pictures of human microbiome study using two different reference databases (Greengenes and Silva) and four levels of the similarity threshold (ranging from 90 to 99%), processed in the QIIME bioinformatics package. The results of the two best-performing classification and clustering algorithms are presented in detail: Random Forest classifier (RF) and Spectral clustering (SC). The random forest classifier has outperformed spectral clustering in terms of accuracy. As the rate of data generation increases, while the cost of labeling remains high, further improvement of clustering performance and ensemble approaches should be explored.

[1]  Donovan H Parks,et al.  Measures of phylogenetic differentiation provide robust and complementary insights into microbial communities , 2012, The ISME Journal.

[2]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[3]  S. Tringe,et al.  High-Throughput Metagenomic Technologies for Complex Microbial Community Analysis: Open and Closed Formats , 2015, mBio.

[4]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[5]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[6]  A. Butte,et al.  The Integrative Human Microbiome Project: Dynamic Analysis of Microbiome-Host Omics Profiles during Periods of Human Health and Disease , 2014, Cell host & microbe.

[7]  D. Huson,et al.  SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? , 2017, BMC Genomics.

[8]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[9]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[10]  Karoline Faust,et al.  Metagenomics meets time series analysis: unraveling microbial community dynamics. , 2015, Current opinion in microbiology.

[11]  Andreas Wilke,et al.  A RESTful API for Accessing Microbial Community Data for MG-RAST , 2015, PLoS Comput. Biol..

[12]  Katherine H. Huang,et al.  The Human Microbiome Project: A Community Resource for the Healthy Human Microbiome , 2012, PLoS biology.

[13]  S. Abbott,et al.  16S rRNA Gene Sequencing for Bacterial Identification in the Diagnostic Laboratory: Pluses, Perils, and Pitfalls , 2007, Journal of Clinical Microbiology.

[14]  R. Knight,et al.  Moving pictures of the human microbiome , 2011, Genome Biology.

[15]  Pierre Legendre,et al.  Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. , 2013, Ecology letters.

[16]  Rob Knight,et al.  The Earth Microbiome project: successes and aspirations , 2014, BMC Biology.

[17]  Curtis Huttenhower,et al.  A Guide to Enterotypes across the Human Body: Meta-Analysis of Microbial Community Structures in Human Microbiome Datasets , 2013, PLoS Comput. Biol..

[18]  William A. Walters,et al.  Erratum to: Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity , 2015, Microbiome.

[19]  M. Thomas P. Gilbert,et al.  Environmental genes and genomes: understanding the differences and challenges in the approaches and software for their analyses , 2015, Briefings Bioinform..

[20]  Eric P. Nawrocki,et al.  An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea , 2011, The ISME Journal.

[21]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[22]  Peng Yang,et al.  Microbial community pattern detection in human body habitats via ensemble clustering framework , 2014, BMC Systems Biology.

[23]  R. Knight,et al.  Microbial community resemblance methods differ in their ability to detect biologically relevant patterns , 2010, Nature Methods.

[24]  Rob Knight,et al.  Earth Microbiome Project and Global Systems Biology , 2018, mSystems.

[25]  L. Hubert,et al.  Comparing partitions , 1985 .

[26]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[27]  Aaron W Miller,et al.  Modeling time-series data from microbial communities , 2016 .

[28]  Lu Wang,et al.  The NIH Human Microbiome Project. , 2009, Genome research.

[29]  Carlos P. Garay,et al.  Time Series Analysis of the Microbiota of Children Suffering From Acute Infectious Diarrhea and Their Recovery After Treatment , 2018, Front. Microbiol..

[30]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[31]  Naiara Rodríguez-Ezpeleta,et al.  Benchmarking DNA Metabarcoding for Biodiversity-Based Monitoring and Assessment , 2016, Front. Mar. Sci..

[32]  Vladimir S. Crnojevic,et al.  Ensemble Approaches for Stable Assessment of Clusters in Microbiome Samples , 2016, CIBB.

[33]  Mark J. Bailey,et al.  TerraGenome: a consortium for the sequencing of a soil metagenome , 2009, Nature Reviews Microbiology.

[34]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[35]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .