Gclust: A Parallel Clustering Tool for Microbial Genomic Data

The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources. Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential. However, existing clustering algorithms perform poorly on long genomic sequences. In this article, we present Gclust, a parallel program for clustering complete or draft genomic sequences, where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays (SSAs). Moreover, genome identity measures between two sequences are calculated based on their maximal exact matches (MEMs). In this paper, we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets. Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust. We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.

[1]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[2]  Alain Guénoche,et al.  Comparing bacterial genomes from linear orders of patterns , 2008, Discret. Appl. Math..

[3]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4]  Mona Singh,et al.  A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays , 2009, Bioinform..

[5]  David Kaeli,et al.  Introduction to Parallel Programming , 2013 .

[6]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[7]  Daniel H. Huson,et al.  Segment Match Refinement and Applications , 2002, WABI.

[8]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2018 .

[9]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[10]  F. Blattner,et al.  Mauve: multiple alignment of conserved genomic sequence with rearrangements. , 2004, Genome research.

[11]  Knut Reinert,et al.  SeqAn An efficient, generic C++ library for sequence analysis , 2008, BMC Bioinformatics.

[12]  Elaine R. Mardis,et al.  A decade’s perspective on DNA sequencing technology , 2011, Nature.

[13]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[14]  Srinivas Aluru,et al.  Large-Scale metagenomic sequence Clustering on Map-Reduce Clusters , 2013, J. Bioinform. Comput. Biol..

[15]  Meriem El Karoui,et al.  A Genomic Distance Based on MUM Indicates Discontinuity between Most Bacterial Species and Genera , 2008, Journal of bacteriology.

[16]  Huzefa Rangwala,et al.  Metagenome sequence clustering with hash-based canopies , 2017, J. Bioinform. Comput. Biol..

[17]  Rocky Ross Education forum: Where Have all the Women Gone? , 1997, SIGA.

[18]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[19]  Karin M. Verspoor,et al.  Evaluation of CD-HIT for constructing non-redundant databases , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[20]  Thomas Abeel,et al.  SynerClust: a highly scalable, synteny-aware orthologue clustering tool , 2018, Microbial genomics.

[21]  Bernard De Baets,et al.  essaMEM: finding maximal exact matches using enhanced sparse suffix arrays , 2013, Bioinform..

[22]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[23]  Inge Jonassen,et al.  Fast Sequence Clustering Using A Suffix Array Algorithm , 2003, Bioinform..

[24]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[25]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[26]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.