Suffix trees for very large inputs

A suffix tree is a fundamental data structure for string searching algorithms. Unfortunately, when it comes to the use of suffix trees in real-life applications, the current methods for constructing suffix trees do not scale for large inputs. As suffix trees are larger than their input sequences and quickly outgrow the main memory, the first half of this work is focused on designing a practical algorithm that avoids massive random access to the trees being built. This effort resulted in a new algorithm DiGeST which improves significantly over previous work in reducing random access to the suffix tree and performing only two passes over disk data. As a result, this algorithm scales to larger genomic data than managed before. All the existing practical algorithms perform random access to the input string, thus requiring in essence that the input be small enough to be kept in main memory. The ever increasing amount of genomic data requires however the ability to build suffix trees for much larger strings. In the second half of this work we present another suffix tree construction algorithm, B2ST that is able to construct suffix trees for input sequences significantly larger than the size of the available main memory. Both the input string and the suffix tree are kept on disk and the algorithm is designed to avoid multiple random I/Os to both of them. As a proof of concept, we show that B2 ST allows to build a suffix tree for 12 GB of real DNA sequences in 26 hours on a single machine with 2 GB of RAM. This input is four times the size of the Human Genome. The construction of suffix trees for inputs of such magnitude was never reported before. Finally, we show that, after the off-line suffix tree construction is complete, search queries on entire sequenced genomes can be performed very efficiently. This high query performance is achieved due to a special disk layout of the suffix trees produced by our algorithms.

[1]  Dan Gusfield Algorithms on Strings, Trees, and Sequences: First Applications of Suffix Trees , 1997 .

[2]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[3]  Mark Nelson,et al.  Fast string searching with suffix trees , 1996 .

[4]  Wojciech Szpankowski,et al.  Self-Alignments in Words and Their Applications , 1992, J. Algorithms.

[5]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[6]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[7]  Jennifer Widom,et al.  Database System Implementation , 2000 .

[8]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[9]  Arne Andersson,et al.  Suffix Trees on Words , 1996, CPM.

[10]  Alex Thomo,et al.  A new method for indexing genomes using on-disk suffix trees , 2008, CIKM '08.

[11]  Konstantin Makarychev,et al.  Serial and parallel methods for i/o efficient suffix tree construction , 2009, SIGMOD Conference.

[12]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[13]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2004, Algorithmica.

[14]  Jignesh M. Patel,et al.  Practical Suffix Tree Construction , 2004, VLDB.

[15]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[16]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[17]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[18]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[19]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[20]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[21]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[22]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[23]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[24]  Srikanta J. Bedathur,et al.  Engineering a fast online persistent suffix tree construction , 2004, Proceedings. 20th International Conference on Data Engineering.

[25]  Raphaël Clifford Distributed suffix trees , 2005, J. Discrete Algorithms.

[26]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[27]  Jignesh M. Patel,et al.  Practical methods for constructing suffix trees , 2005, The VLDB Journal.

[28]  Marek J. Sergot,et al.  Distributed and Paged Suffix Trees for Large Genetic Databases , 2003, CPM.

[29]  Wolfgang Gerlach,et al.  Compressed suffix tree - a basis for genome-scale sequence analysis , 2007, Bioinform..

[30]  Robert Giegerich,et al.  From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction , 1997, Algorithmica.

[31]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[32]  Giovanni Manzini,et al.  Two Space Saving Tricks for Linear Time LCP Array Computation , 2004, SWAT.

[33]  Ricardo A. Baeza-Yates,et al.  A New Indexing Method for Approximate String Matching , 1999, CPM.

[34]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[35]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[36]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[37]  Rakesh Tuli,et al.  The TATA-Box Sequence in the Basal Promoter Contributes to Determining Light-Dependent Gene Expression in Plants1[W] , 2006, Plant Physiology.

[38]  Gonzalo Navarro,et al.  Dynamic Fully-Compressed Suffix Trees , 2008, CPM.

[39]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[40]  Klaudia Walter,et al.  Highly Conserved Non-Coding Sequences Are Associated with Vertebrate Development , 2004, PLoS biology.

[41]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[42]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[43]  Jeffrey Scott Vitter,et al.  Algorithms for parallel memory, I: Two-level memories , 2005, Algorithmica.

[44]  Johannes Fischer,et al.  Space Efficient String Mining under Frequency Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[45]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[46]  S. Muthukrishnan,et al.  Optimal Logarithmic Time Randomized Suffix Tree Construction , 1996, ICALP.

[47]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[48]  Mohammed J. Zaki,et al.  Genome-scale disk-based suffix tree indexing , 2007, SIGMOD '07.

[49]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 2003, Softw. Pract. Exp..

[50]  R. Gregory The evolution of the genome , 2005 .

[51]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[52]  Gonzalo Navarro,et al.  An(other) Entropy-Bounded Compressed Suffix Tree , 2008, CPM.

[53]  Edward F. Grove,et al.  External-memory graph algorithms , 1995, SODA '95.

[54]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[55]  William F. Smyth,et al.  Computing Patterns in Strings , 2003 .

[56]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[57]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[58]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[59]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[60]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.

[61]  Gad M. Landau,et al.  On Cartesian Trees and Range Minimum Queries , 2009, ICALP.

[62]  Gad M. Landau,et al.  Introducing efficient parallelism into approximate string matching and a new serial algorithm , 1986, STOC '86.

[63]  Wolfgang Gerlach,et al.  Engineering a Compressed Suffix Tree Implementation , 2007, WEA.