DMSC: A Dynamic Multi-Seeds Method for Clustering 16S rRNA Sequences Into OTUs

Next-generation sequencing (NGS)-based 16S rRNA sequencing by jointly using the PCR amplification and NGS technology is a cost-effective technique, which has been successfully used to study the phylogeny and taxonomy of samples from complex microbiomes or environments. Clustering 16S rRNA sequences into operational taxonomic units (OTUs) is often the first step for many downstream analyses. Heuristic clustering is one of the most widely employed approaches for generating OTUs. However, most heuristic OTUs clustering methods just select one single seed sequence to represent each cluster, resulting in their outcomes suffer from either overestimation of OTUs number or sensitivity to sequencing errors. In this paper, we present a novel dynamic multi-seeds clustering method (namely DMSC) to pick OTUs. DMSC first heuristically generates clusters according to the distance threshold. When the size of a cluster reaches the pre-defined minimum size, then DMSC selects the multi-core sequences (MCS) as the seeds that are defined as the n-core sequences (n ≥ 3), in which the distance between any two sequences is less than the distance threshold. A new sequence is assigned to the corresponding cluster depending on the average distance to MCS and the distance standard deviation within the MCS. If a new sequence is added to the cluster, dynamically update the MCS until no sequence is merged into the cluster. The new method DMSC was tested on several simulated and real-life sequence datasets and also compared with the traditional heuristic methods such as CD-HIT, UCLUST, and DBH. Experimental results in terms of the inferred OTUs number, normalized mutual information (NMI) and Matthew correlation coefficient (MCC) metrics demonstrate that DMSC can produce higher quality clusters with low memory usage and reduce OTU overestimation. Additionally, DMSC is also robust to the sequencing errors. The DMSC software can be freely downloaded from https://github.com/NWPU-903PR/DMSC.

[1]  Susan M. Huse,et al.  Ironing out the wrinkles in the rare biosphere through improved OTU clustering , 2010, Environmental microbiology.

[2]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[3]  N. Pace,et al.  Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Robert C. Edgar,et al.  Taxonomy annotation and guide tree errors in 16S rRNA databases , 2018, PeerJ.

[5]  Shao-Wu Zhang,et al.  MtHc: a motif-based hierarchical method for clustering massive 16S rRNA sequences into OTUs. , 2015, Molecular bioSystems.

[6]  Xiaoyu Wang,et al.  M-pick, a modularity-based method for OTU picking of 16S rRNA sequences , 2013, BMC Bioinformatics.

[7]  Gail L. Rosen,et al.  Quikr: a method for rapid reconstruction of bacterial communities via compressive sensing , 2013, Bioinform..

[8]  Gerhard G. Thallinger,et al.  Wx Scout Fashion Sneaker Splash Navy Women's Keds qAS4tR1wn4 for bawln.com , 2009 .

[9]  Xiaoyu Wang,et al.  A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis , 2012, Briefings Bioinform..

[10]  Wei Chen,et al.  MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence. , 2013, Journal of microbiological methods.

[11]  N. Shental,et al.  Combining 16S rRNA gene variable regions enables high-resolution microbial community profiling , 2017, Microbiome.

[12]  Shao-Wu Zhang,et al.  Exploring the interaction patterns in seasonal marine microbial communities with network analysis , 2013, 2013 7th International Conference on Systems Biology (ISB).

[13]  Yongmei Cheng,et al.  A Comparison of Methods for Clustering 16S rRNA Sequences into OTUs , 2013, PloS one.

[14]  Shao-Wu Zhang,et al.  NPBSS: a new PacBio sequencing simulator for generating the continuous long reads with an empirical model , 2018, BMC Bioinformatics.

[15]  Rafael P. Mellado,et al.  Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows , 2011, BMC Bioinformatics.

[16]  C. von Mering,et al.  HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences , 2013, Bioinformatics.

[17]  Sarah L. Westcott,et al.  De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units , 2015, PeerJ.

[18]  Shao-Wu Zhang,et al.  DMclust, a Density‐based Modularity Method for Accurate OTU Picking of 16S rRNA Sequences , 2017, Molecular informatics.

[19]  Jukka Corander,et al.  Bayesian estimation of bacterial community composition from 454 sequencing data , 2012, Nucleic acids research.

[20]  Christian von Mering,et al.  Limits to robustness and reproducibility in the demarcation of operational taxonomic units. , 2015, Environmental microbiology.

[21]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[22]  Yunpeng Cai,et al.  ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time , 2011, Nucleic acids research.

[23]  Quan Pan,et al.  Classifier Fusion With Contextual Reliability Evaluation , 2018, IEEE Transactions on Cybernetics.

[24]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[25]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[26]  Shao-Wu Zhang,et al.  Exploring the interaction patterns among taxa and environments from marine metagenomic data , 2016, Quantitative Biology.

[27]  Ting Chen,et al.  Clustering 16S rRNA for OTU prediction: a method of unsupervised Bayesian clustering , 2011, Bioinform..

[28]  Bertil Schmidt,et al.  DySC: software for greedy clustering of 16S rRNA reads , 2012, Bioinform..

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  Shao-Wu Zhang,et al.  DBH: A de Bruijn graph-based heuristic method for clustering large-scale 16S rRNA sequences into OTUs. , 2017, Journal of theoretical biology.

[31]  Wei Zheng,et al.  ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time , 2017, PLoS Comput. Biol..

[32]  Quan Pan,et al.  Combination of Classifiers With Optimal Weight Based on Evidential Reasoning , 2018, IEEE Transactions on Fuzzy Systems.

[33]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[34]  Lu Wang,et al.  The NIH Human Microbiome Project. , 2009, Genome research.

[35]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[36]  Patrick D. Schloss,et al.  Assessing and Improving Methods Used in Operational Taxonomic Unit-Based Approaches for 16S rRNA Gene Sequence Analysis , 2011, Applied and Environmental Microbiology.

[37]  William G. Mckendree,et al.  ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences , 2009, Nucleic acids research.

[38]  Patrick D Schloss,et al.  OptiClust, an Improved Method for Assigning Amplicon-Based Sequence Data to Operational Taxonomic Units , 2017, mSphere.

[39]  Patrick D Schloss,et al.  Application of a Database-Independent Approach To Assess the Quality of Operational Taxonomic Unit Picking Methods , 2016, mSystems.

[40]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[41]  Ying Huang,et al.  bioOTU: An Improved Method for Simultaneous Taxonomic Assignments and Operational Taxonomic Units Clustering of 16s rRNA Gene Sequences , 2016, J. Comput. Biol..

[42]  Jose A Navas-Molina,et al.  Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns , 2017, mSystems.