KABOOM! A new suffix array based algorithm for clustering expression data

MOTIVATION Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets. RESULTS We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time. AVAILABILITY Source code and binaries available under GPL at http://code.google.com/p/wcdest. Runs on Linux and MacOS X. CONTACT scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Robert Giegerich,et al.  mkESA: enhanced suffix array construction tool , 2009, Bioinform..

[2]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[3]  Martin Vingron,et al.  q-gram based database searching using a suffix array (QUASAR) , 1999, RECOMB.

[4]  Zsuzsanna Lipták,et al.  An overview of the wcd EST clustering tool , 2008, Bioinform..

[5]  Scott Hazelhurst Algorithms for clustering expressed sequence tags: the wcd tool , 2008, South Afr. Comput. J..

[6]  Zsuzsanna Lipták,et al.  A method for evaluating the quality of string dissimilarity measures and clustering algorithms for EST clustering , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[7]  David C. Schwartz,et al.  New Generations: Sequencing Machines and Their Computational Challenges , 2010, Journal of Computer Science and Technology.

[8]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[9]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[10]  Johann van der Merwe,et al.  A survey on peer-to-peer key management for mobile ad hoc networks , 2007, CSUR.

[11]  Raphael Clifford,et al.  Combinatorial Pattern Matching (CPM) , 2011 .

[12]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[13]  Ketil Malde ALGORITHMS FOR THE ANALYSIS OF EXPRESSED SEQUENCE TAGS , 2005 .

[14]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2004, Algorithmica.

[15]  John E. Karro,et al.  PEACE: Parallel Environment for Assembly and Clustering of Gene Expression , 2010, Nucleic Acids Res..

[16]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[17]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[18]  Inge Jonassen,et al.  Fast Sequence Clustering Using A Suffix Array Algorithm , 2003, Bioinform..

[19]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[20]  John Quackenbush,et al.  TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets , 2003, Bioinform..

[21]  Srinivas Aluru,et al.  Parallel EST clustering , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[22]  Ernesto Picardi,et al.  EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data , 2009, BMC Bioinformatics.

[23]  Winston A Hide,et al.  A comprehensive approach to clustering of expressed human gene sequence: the sequence tag alignment and consensus knowledge base. , 1999, Genome research.

[24]  Daniel H. Huson,et al.  48. MetaSim: A Sequencing Simulator for Genomics and Metagenomics , 2011 .

[25]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[26]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[27]  Scott Hazelhurst,et al.  ESTSim : A tool for creating benchmarks for EST clustering algorithms , 2003 .

[28]  Keith Robison Editorial: Second-generation sequencing , 2010, Briefings Bioinform..

[29]  Sven Rahmann,et al.  Exact and Efficient Computation of the Expected Number of Missing and Common Words in Random Texts , 2000, CPM.

[30]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .