Matching statistics: efficient computation and a new practical algorithm for the multiple common substring problem

We present new algorithms for computing matching statistics with suffix arrays. We show how the Multiple Common Substring Problem can be solved efficiently in practice with a new approach using matching statistics. This problem consists of finding the common substrings of a set of strings. For the computation of matching statistics we compare seven different methods based on suffix trees and suffix arrays. Most of the suffix array algorithms have an inferior asymptotic worst case running time but a very low memory overhead and small constants in the running time complexity. Our experiments show a good performance in practice. Copyright © 2005 John Wiley & Sons, Ltd.

[1]  Wojciech Szpankowski,et al.  A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors , 1993, SIAM J. Comput..

[2]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[3]  Enno Ohlebusch,et al.  Optimal Exact Strring Matching Based on Suffix Arrays , 2002, SPIRE.

[4]  Enno Ohlebusch,et al.  The Enhanced Suffix Array and Its Applications to Genome Analysis , 2002, WABI.

[5]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[6]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[7]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[8]  Robert Giegerich,et al.  From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction , 1997, Algorithmica.

[9]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.

[10]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[11]  Moritz G. Maaß A Fast Algorithm for the Inexact Characteristic String Problem , 2003 .

[12]  David Haussler,et al.  Average sizes of suffix trees and DAWGs , 1989, Discret. Appl. Math..

[13]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[14]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[15]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[16]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[17]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[18]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[19]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[20]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[21]  Wojciech Szpankowski,et al.  Asymptotic properties of data compression and suffix trees , 1993, IEEE Trans. Inf. Theory.

[22]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[23]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[24]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2004, Algorithmica.

[25]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[26]  Dong Kyue Kim,et al.  An Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays for Alphabets of Non-negligible Size , 2004, SPIRE.

[27]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[28]  Uzi Vishkin,et al.  On Finding Lowest Common Ancestors: Simplification and Parallelization , 1988, AWOC.

[29]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[30]  Dong Kyue Kim,et al.  A Fast Algorithm for Constructing Suffix Arrays for Fixed-Size Alphabets , 2004, WEA.

[31]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[32]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[33]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 2003, Softw. Pract. Exp..