SEED: efficient clustering of next-generation sequences

Motivation: Similarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads. Results: Here, we introduce SEED—an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60–85% and 21–41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12–27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms. Availability: The SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/seed. Contact: thomas.girke@ucr.edu Supplementary information: Supplementary data are available at Bioinformatics online

[1]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[2]  Robert A Holt,et al.  The new paradigm of flow cell sequencing. , 2008, Genome research.

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  R. Durbin,et al.  Mapping Quality Scores Mapping Short Dna Sequencing Reads and Calling Variants Using P

, 2022 .

[5]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[6]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[7]  Elliot M. Meyerowitz,et al.  Orchestration of Floral Initiation by APETALA1 , 2010, Science.

[8]  S. Morishita,et al.  Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. , 2009, Genome research.

[9]  Cameron Johnson,et al.  Clusters and superclusters of phased small RNAs in the developing inflorescence of rice. , 2009, Genome research.

[10]  Zsuzsanna Lipták,et al.  An overview of the wcd EST clustering tool , 2008, Bioinform..

[11]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[12]  Bin Ma,et al.  ZOOM! Zillions of oligos mapped , 2008, Bioinform..

[13]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[14]  Bertil Schmidt,et al.  A fast hybrid short read fragment assembly algorithm , 2009, Bioinform..

[15]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[16]  John E. Karro,et al.  PEACE: Parallel Environment for Assembly and Clustering of Gene Expression , 2010, Nucleic Acids Res..

[17]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[18]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[19]  Ji Hoon Ahn,et al.  AGO1-miR173 complex initiates phased siRNA formation in plants , 2008, Proceedings of the National Academy of Sciences.

[20]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[21]  Julian Parkhill,et al.  Microbiology in the post-genomic era , 2008, Nature Reviews Microbiology.

[22]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[23]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[24]  Wing Hung Wong,et al.  SeqMap: mapping massive amount of oligonucleotides to the genome , 2008, Bioinform..

[25]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[26]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[27]  Ewan Birney,et al.  Assemblies: the good, the bad, the ugly , 2010, Nature Methods.

[28]  René L. Warren,et al.  Assembling millions of short DNA sequences using SSAKE , 2006, Bioinform..

[29]  Wen-Hsiung Li,et al.  Uncovering Small RNA-Mediated Responses to Phosphate Deficiency in Arabidopsis by Deep Sequencing1[W][OA] , 2009, Plant Physiology.

[30]  Ernesto Picardi,et al.  EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data , 2009, BMC Bioinformatics.

[31]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[32]  E. Meyerowitz,et al.  Cell-type specific analysis of translating RNAs in developing flowers reveals new levels of control , 2010, Molecular Systems Biology.