(Un)expected behavior of typical suffix trees

Suffix tree is a data structure widely used in algorithms on words and data compression. Despite this, very little is known about its typical behavior. Recently, Chang and Lawler have designed a sublinear expected time algorithm for approximate string matching using simple estimates of some parameters of suffix trees. It seems that any further advances in such an endover are subject to better understanding of suffix trees behavior. In this paper, we use a novel technique called string ruler approach to provide a characterization of several basic parameters of suffix trees (dependency among symbols are allowed !). These findings are used to :(i) settle in the negative the conjecture of Wyner and Ziv regarding the typical behavior of the universal data compression scheme of Lampel and Ziv; (ii) prove an open problem regarding the length of a block in the Lampel-Ziv parsing algorithm; (iii) provide new insights and generalizations of string matching algorithms, particularly the one by Chang and Lawler.

[1]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[2]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[3]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[4]  Philippe Jacquet,et al.  Analysis of digital tries with Markovian dependency , 1991, IEEE Trans. Inf. Theory.

[5]  Samuel Karlin,et al.  Counts of long aligned word matches among random letter sequences , 1987, Advances in Applied Probability.

[6]  Zvi Galil,et al.  An Improved Algorithm for Approximate String Matching , 1990, SIAM J. Comput..

[7]  Wojciech Szpankowski,et al.  Asymptotic properties of data compression and suffix trees , 1993, IEEE Trans. Inf. Theory.

[8]  K. Chung,et al.  On the application of the borel-cantelli lemma , 1952 .

[9]  Wojciech Szpankowski,et al.  Suffi x Trees Revisited: (Un)Expected Asymptotic Behaviors , 1991 .

[10]  David Haussler,et al.  Average sizes of suffix trees and DAWGs , 1989, Discret. Appl. Math..

[11]  Eugene L. Lawler,et al.  Approximate string matching in sublinear expected time , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[12]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[13]  Andrew Odlyzko,et al.  Long repetitive patterns in random sequences , 1980 .

[14]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[15]  Uzi Vishkin,et al.  Fast String Matching with k Differences , 1988, J. Comput. Syst. Sci..

[16]  P Erd,et al.  On the application of the borel-cantelli lemma , 1952 .

[17]  Uzi Vishkin,et al.  Deterministic sampling—a new technique for fast pattern matching , 1990, STOC '90.

[18]  I. Good,et al.  Ergodic theory and information , 1966 .

[19]  Peter Grassberger,et al.  Estimating the information content of symbol sequences and efficient codes , 1989, IEEE Trans. Inf. Theory.

[20]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[21]  B. Pittel Asymptotical Growth of a Class of Random Trees , 1985 .

[22]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[23]  Wojciech Szpankowski,et al.  A Note on the Height of Suffix Trees , 1992, SIAM J. Comput..

[24]  Franco P. Preparata,et al.  Optimal Off-Line Detection of Repetitions in a String , 1983, Theor. Comput. Sci..

[25]  Philippe Jacquet,et al.  Autocorrelation on Words and Its Applications - Analysis of Suffix Trees by String-Ruler Approach , 1994, J. Comb. Theory, Ser. A.

[26]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[27]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[28]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory, Ser. A.

[29]  Wojciech Szpankowski,et al.  Self-Alignments in Words and Their Applications , 1992, J. Algorithms.

[30]  B. Pittel Paths in a random digital tree: limiting distributions , 1986, Advances in Applied Probability.