ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.

[1]  Yunpeng Cai,et al.  ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time , 2011, Nucleic acids research.

[2]  Jian Ye,et al.  BLAST: improvements for better sequence analysis , 2006, Nucleic Acids Res..

[3]  Sarah L. Westcott,et al.  De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units , 2015, PeerJ.

[4]  Xiaoyu Wang,et al.  M-pick, a modularity-based method for OTU picking of 16S rRNA sequences , 2013, BMC Bioinformatics.

[5]  Bertil Schmidt,et al.  Efficient and Accurate OTU Clustering with GPU-Based Sequence Alignment and Dynamic Dendrogram Cutting , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Rytas Vilgalys,et al.  Fungal Community Analysis by Large-Scale Sequencing of Environmental Samples , 2005, Applied and Environmental Microbiology.

[7]  Bing Zhou,et al.  A parallel hierarchical clustering algorithm for PCs cluster system , 2007, Neurocomputing.

[8]  Wei Chen,et al.  MSClust: A Multi-Seeds based Clustering algorithm for microbiome profiling using 16S rRNA sequence. , 2013, Journal of microbiological methods.

[9]  Rob Knight,et al.  Secondary structure information does not improve OTU assignment for partial 16s rRNA sequences , 2012, The ISME Journal.

[10]  C. von Mering,et al.  HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences , 2013, Bioinformatics.

[11]  Martin Wu,et al.  Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units , 2013, Nucleic acids research.

[12]  Wei Zheng,et al.  Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis , 2015, 2015 IEEE International Conference on Data Mining.

[13]  Weida Tong,et al.  Two new ArrayTrack libraries for personalized biomedical research , 2010, BMC Bioinformatics.

[14]  Raymond K. Auerbach,et al.  The real cost of sequencing: higher than you think! , 2011, Genome Biology.

[15]  Marcus J. Claesson,et al.  Composition, variability, and temporal stability of the intestinal microbiota of the elderly , 2010, Proceedings of the National Academy of Sciences.

[16]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[17]  James R. Cole,et al.  The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis , 2004, Nucleic Acids Res..

[18]  Jonathan W. Pillow,et al.  POSTER PRESENTATION Open Access , 2013 .

[19]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[20]  G. Gloor,et al.  High throughput sequencing methods and analysis for microbiome research. , 2013, Journal of microbiological methods.

[21]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[22]  Susan M. Huse,et al.  Microbial diversity in the deep sea and the underexplored “rare biosphere” , 2006, Proceedings of the National Academy of Sciences.

[23]  Xiaoyu Wang,et al.  A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis , 2012, Briefings Bioinform..

[24]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[25]  C. Pedrós-Alió,et al.  Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton , 2001, Nature.

[26]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[27]  Martin Vingron,et al.  Large scale hierarchical clustering of protein sequences , 2005, BMC Bioinformatics.

[28]  Rafael P. Mellado,et al.  Estimation of bacterial diversity using next generation sequencing of 16S rDNA: a comparison of different workflows , 2011, BMC Bioinformatics.

[29]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[30]  Alan P. Sprague,et al.  Reproducible Clusters from Microarray Research: Whither? , 2005, BMC Bioinformatics.

[31]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[32]  William G. Mckendree,et al.  ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences , 2009, Nucleic acids research.

[33]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[34]  Sanne Abeln,et al.  Unraveling the outcome of 16S rDNA-based taxonomy analysis through mock data and simulations , 2014, Bioinform..

[35]  Lu Wang,et al.  The NIH Human Microbiome Project. , 2009, Genome research.

[36]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[37]  Steven Skiena,et al.  The Algorithm Design Manual , 2020, Texts in Computer Science.

[38]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[39]  Ben Niu,et al.  Biomimicry of quorum sensing using bacterial lifecycle model , 2013, BMC Bioinformatics.

[40]  Alex Bateman,et al.  QuickTree: building huge Neighbour-Joining trees of protein sequences , 2002, Bioinform..

[41]  Niko Beerenwinkel,et al.  Ultra-deep sequencing for the analysis of viral populations. , 2011, Current opinion in virology.

[42]  Sanne Abeln,et al.  Comparing clustering and pre-processing in taxonomy analysis , 2012, Bioinform..

[43]  M. Egholm,et al.  Measurement and Clinical Monitoring of Human Lymphocyte Clonality by Massively Parallel V-D-J Pyrosequencing , 2009, Science Translational Medicine.

[44]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[45]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[46]  Alice Carolyn McHardy,et al.  Taxonomic binning of metagenome samples generated by next-generation sequencing technologies , 2012, Briefings Bioinform..

[47]  Mihai Pop,et al.  Alignment and clustering of phylogenetic markers - implications for microbial diversity studies , 2010, BMC Bioinformatics.

[48]  P. Schloss,et al.  Dynamics and associations of microbial community types across the human body , 2014, Nature.

[49]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[50]  Austin G. Davis-Richardson,et al.  TaxCollector: Modifying Current 16S rRNA Databases for the Rapid Classification at Six Taxonomic Levels , 2010 .

[51]  Peter Scheuermann,et al.  Efficient Parallel Hierarchical Clustering , 2004, Euro-Par.

[52]  Michael J. Quinn,et al.  Parallel programming in C with MPI and OpenMP , 2003 .

[53]  Jullien M. Flynn,et al.  Toward accurate molecular identification of species in complex environmental samples: testing the performance of sequence filtering and clustering methods , 2015, Ecology and evolution.