On-Line Linear-Time Construction of Word Suffix Trees

Suffix trees are the key data structure for text string matching, and are used in wide application areas such as bioinformatics and data compression. Sparse suffix trees are kind of suffix trees that represent only a subset of suffixes of the input string. In this paper we study word suffix trees, which are one variation of sparse suffix trees. Let D be a dictionary of words and w be a string in D+, namely, w is a sequence w1 ⋯wk of k words in D. The word suffix tree of w w.r.t. D is a path-compressed trie that represents only the k suffixes in the form of wi ⋯wk. A typical example of its application is word- and phrase-level search on natural language documents. Andersson et al. proposed an algorithm to build word suffix trees in O(n) expected time with O(k) space. In this paper we present a new word suffix tree construction algorithm with O(n) running time and O(k) space in the worst cases. Our algorithm is on-line, which means that it can sequentially process the characters in the input, each by each, from left to right.

[1]  Ayumi Shinohara,et al.  Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts , 2002, SPIRE.

[2]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[3]  Ayumi Shinohara,et al.  Efficiently Finding Regulatory Elements Using Correlation with Gene Expression , 2004, J. Bioinform. Comput. Biol..

[4]  Ayumi Shinohara,et al.  Finding Optimal Pairs of Cooperative and Competing Patterns with Bounded Distance , 2004, Discovery Science.

[5]  Joong Chae Na,et al.  Truncated suffix trees and their application to data compression , 2003, Theor. Comput. Sci..

[6]  Marek J. Sergot,et al.  Distributed and Paged Suffix Trees for Large Genetic Databases , 2003, CPM.

[7]  Shunsuke Inenaga,et al.  Finding Missing Patterns , 2004, WABI.

[8]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[9]  Gaston H. Gonnet,et al.  Efficient Text Searching of Regular Expressions , 1989, WADS.

[10]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[11]  Ayumi Shinohara,et al.  Simple Linear-Time Off-Line Text Compression by Longest-First Substitution , 2007, 2007 Data Compression Conference (DCC'07).

[12]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[13]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[14]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[15]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[16]  Marie-France Sagot,et al.  Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.

[17]  N. Jesper Larsson Extended application of suffix trees to data compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[18]  Bogdan Dorohonceanu,et al.  Accelerating Protein Classification Using Suffix Trees , 2000, ISMB.

[19]  Gaston H. Gonnet,et al.  Efficient Text Searching of Regular Expressions (Extended Abstract) , 1989, ICALP.

[20]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[21]  Arne Andersson,et al.  Suffix Trees on Words , 1996, Algorithmica.