Efficient Computation of Substring Equivalence Classes with Suffix Arrays

This paper considers enumeration of substring equivalence classes introduced by Blumer et al. [1]. They used the equivalence classes to define an index structure called compact directed acyclic word graphs (CDAWGs). In text analysis, considering these equivalence classes is useful since they group together redundant substrings with essentially identical occurrences. In this paper, we present how to enumerate those equivalence classes using suffix arrays. Our algorithm uses rank and lcp arrays for traversing the corresponding suffix trees, but does not need any other additional data structure. The algorithm runs in linear time in the length of the input string. We show experimental results comparing the running times and space consumptions of our algorithm, suffix tree and CDAWG based approaches.

[1]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[2]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[3]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[4]  Kunihiko Sadakane,et al.  Faster suffix sorting , 2007, Theoretical Computer Science.

[5]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[6]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[7]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[8]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[9]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[10]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[11]  Dominique Revuz,et al.  Minimisation of Acyclic Deterministic Automata in Linear Time , 1992, Theor. Comput. Sci..

[12]  Timothy C. Bell,et al.  A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[13]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[14]  Dong Kyue Kim,et al.  Constructing suffix arrays in linear time , 2005, J. Discrete Algorithms.

[15]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[16]  Giancarlo Mauri,et al.  On-line construction of compact directed acyclic word graphs , 2005, Discret. Appl. Math..

[17]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[18]  Hideo Bannai,et al.  Unsupervised Spam Detection Based on String Alienness Measures , 2007, Discovery Science.

[19]  Maxime Crochemore,et al.  Direct Construction of Compact Directed Acyclic Word Graphs , 1997, CPM.

[20]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[21]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[22]  Masayuki Takeda,et al.  Discovering characteristic expressions in literary works , 2003, Theor. Comput. Sci..

[23]  William F. Smyth,et al.  Fast Optimal Algorithms for Computing All the Repeats in a String , 2008, Stringology.