Efficient implementation of lazy suffix trees

We present an efficient implementation of a write‐only top‐down construction for suffix trees. Our implementation is based on a new, space‐efficient representation of suffix trees that requires only 12 bytes per input character in the worst case, and 8.5 bytes per input character on average for a collection of files of different type. We show how to efficiently implement the lazy evaluation of suffix trees such that a subtree is evaluated only when it is traversed for the first time. Our experiments show that for the problem of searching many exact patterns in a fixed input string, the lazy top‐down construction is often faster and more space efficient than other methods. Copyright © 2003 John Wiley & Sons, Ltd.

[1]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[2]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[3]  H. M. Martinez,et al.  An efficient method for finding repeats in molecular sequences , 1983, Nucleic Acids Res..

[4]  David R. Clark,et al.  Efficient suffix trees on secondary storage , 1996, SODA '96.

[5]  Gad M. Landau,et al.  Parallel construction of a suffix tree with applications , 1988, Algorithmica.

[6]  Robert Giegerich,et al.  From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction , 1997, Algorithmica.

[7]  J. Stoye,et al.  REPuter: the manifold applications of repeat analysis on a genomic scale. , 2001, Nucleic acids research.

[8]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[9]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[10]  John E. Hopcroft,et al.  An n log n algorithm for minimizing states in a finite automaton , 1971 .

[11]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[12]  Arne Andersson,et al.  Efficient implementation of suffix trees , 1995, Softw. Pract. Exp..

[13]  Malcolm P. Atkinson,et al.  A Database Index to Large Biological Sequences , 2001, VLDB.

[14]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[15]  Steven Skiena,et al.  Who is interested in algorithms and why?: lessons from the Stony Brook algorithms repository , 1999, SIGA.

[16]  Robert Giegerich,et al.  A Comparison of Imperative and Purely Functional Suffix Tree Constructions , 1995, Sci. Comput. Program..

[17]  B. Shapiro,et al.  Prediction of DNA single-strand conformation polymorphism: analysis by capillary electrophoresis and computerized DNA modeling. , 2001, Nucleic acids research.

[18]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[19]  Livio Colussi,et al.  A Time and Space Efficient Data Structure for String Searching on Large Texts , 1996, Inf. Process. Lett..

[20]  Maxime Crochemore,et al.  Factor Oracle: A New Structure for Pattern Matching , 1999, SOFSEM.

[21]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[22]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[23]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[24]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[25]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[26]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[27]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..