A Heuristic For Computing Repeats With A Factor Oracle: Application To Biological Sequences

We present in this article a linear time and space method for the computation of the length of a repeated suffix for each prefix of a given word p . Our method is based on the utilization of the factor oracle of p which is a new and very compact structure introduced in [1], used for representing all the factors of p . We exhibit applications where our method really speeds up the computation of repetitions in words.

[1]  Maxime Crochemore,et al.  Factor Oracle: A New Structure for Pattern Matching , 1999, SOFSEM.

[2]  Peter M. Fenwick The Burrows-Wheeler Transform for Block Sorting Text Compression: Principles and Improvements , 1996, Comput. J..

[3]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[4]  Arnaud Lefebvre,et al.  Compror: Compression with a Factor Oracle , 2001, Data Compression Conference.

[5]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[6]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[7]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[8]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[9]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[10]  Maxime Crochemore,et al.  Transducers and Repetitions , 1986, Theor. Comput. Sci..

[11]  Wojciech Rytter A Correct Preprocessing Algorithm for Boyer-Moore String-Searching , 1980, SIAM J. Comput..

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[14]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[15]  M. Cotton,et al.  Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana , 1999, Nature.

[16]  William F. Smyth,et al.  Repetitive perhaps, but certainly not boring , 2000, Theor. Comput. Sci..