Self-Alignments in Words and Their Applications

Abstract Some quantities associated with periodicities in words are analyzed within the Bernoulli probabilistic model. In particular, the following problem is addressed. Assume that a string X is given, with symbols emitted randomly but independently according to some known distribution of probabilities. Then, for each pair ( W , Z ) of distinct suffixes of X , the expected length of the longest common prefix of W and Z is sought. The collection of these lengths, that are called here self-alignments , plays a crucial role in several algorithmic problems on words, such as building suffix trees or inverted files, detecting squares and other regularities, computing substring statistics, etc. The asymptotically best algorithms for these problems are quite complex and thus risk being unpractical. The present analysis of self-alignments and related measures suggests that, in a variety of cases, more straightforward algorithmic solutions may yield comparable or even better performances.

[1]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[2]  Richard J. Lorentz,et al.  Linear Time Recognition of Squarefree Strings , 1985 .

[3]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[4]  H. Robbins,et al.  Maximally dependent random variables. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[5]  H. Robbins,et al.  A class of dependent random variables and their maxima , 1978 .

[6]  Wojciech Szpankowski On the Analysis of the Average Height of a Digital Trie: Another Approach , 1986 .

[7]  Franco P. Preparata,et al.  Optimal Off-Line Detection of Repetitions in a String , 1983, Theor. Comput. Sci..

[8]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[9]  Alberto Apostolico,et al.  Robust transmission of unbounded strings using Fibonacci representations , 1987, IEEE Trans. Inf. Theory.

[10]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[11]  Wojciech Szpankowski,et al.  A Note on the Height of Suffix Trees , 1992, SIAM J. Comput..

[12]  Michael Rodeh,et al.  Economical encoding of commas between strings , 1978, CACM.

[13]  Philippe Jacquet,et al.  Autocorrelation on Words and Its Applications - Analysis of Suffix Trees by String-Ruler Approach , 1994, J. Comb. Theory, Ser. A.

[14]  Wojciech Szpankowski,et al.  Patricia tries again revisited , 1990, JACM.

[15]  M. Crochemore Recherche linéaire d'un carré dans un mot , 1983 .

[16]  M McCreightEdward A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[17]  Wojciech Szpankowski Some Results on V-ary Asymmetric Tries , 1988, J. Algorithms.

[18]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[19]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[20]  B. Pittel Asymptotical Growth of a Class of Random Trees , 1985 .

[21]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[22]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[23]  Franco P. Preparata,et al.  Structural Properties of the String Statistics Problem , 1985, J. Comput. Syst. Sci..

[24]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[25]  Michael G. Main,et al.  An O(n log n) Algorithm for Finding All Repetitions in a String , 1984, J. Algorithms.

[26]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[27]  Hans U. Gerber,et al.  The occurrence of sequence patterns in repeated experiments and hitting times in a Markov chain , 1981 .

[28]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory, Ser. A.