Finding Approximate Repetitions under Hamming Distance

The problem of computing periodicities with K possible mismatches is studied. Two main definitions are considered, and for both of them an O(nK logK + S) algorithm is proposed (n the word length and S the size of the output). This improves, in particular, the bound obtained by G. Landan and J. Schmidt in 1993 (Proceedings of the Fourth Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science, Vol. 684, Springer, Berlin, Padova, Italy, pp. 120-133). Finally, other possible definitions are briefly analyzed.

[1]  Michael G. Main,et al.  An O(n log n) Algorithm for Finding All Repetitions in a String , 1984, J. Algorithms.

[2]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[3]  Wojciech Rytter,et al.  Squares, cubes, and time-space efficient string searching , 1995, Algorithmica.

[4]  Michael G. Main,et al.  Detecting leftmost maximal periodicities , 1989, Discret. Appl. Math..

[5]  Zvi Galil,et al.  Time-Space-Optimal String Matching , 1983, J. Comput. Syst. Sci..

[6]  S. Rao Kosaraju,et al.  Computation of Squares in a String (Preliminary Version) , 1994, CPM.

[7]  G. Kucherov,et al.  Maximal Repetitions and Application to DNA sequences , 2000 .

[8]  Gregory Kucherov,et al.  On Maximal Repetitions in Words , 1999, FCT.

[9]  Eugene W. Myers,et al.  Identifying satellites in nucleic acid sequences , 1998, RECOMB '98.

[10]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[11]  Jens Stoye,et al.  Linear time algorithms for finding and representing all the tandem repeats in a string , 2004, J. Comput. Syst. Sci..

[12]  R. Tarjan Amortized Computational Complexity , 1985 .

[13]  Costas S. Iliopoulos,et al.  A Characterization of the Squares in a Fibonacci String , 1997, Theor. Comput. Sci..

[14]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[15]  F. Denoeud,et al.  A tandem repeats database for bacterial genomes: application to the genotyping of Yersinia pestis and Bacillus anthracis , 2001, BMC Microbiology.

[16]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 1993, CPM.

[17]  Gregory Kucherov,et al.  Finding Approximate Repetitions under Hamming Distance , 2001, ESA.

[18]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[19]  Christian Choffrut,et al.  Combinatorics of Words , 1997, Handbook of Formal Languages.

[20]  A. O. Slisenko,et al.  Detection of periodicities and string-matching in real time , 1983 .

[21]  A. van Belkum,et al.  UvA-DARE ( Digital Academic Repository ) Variable number of tandem repeats in clinical strains of Haemophilus influenzae , 1997 .

[22]  Max Dauchet,et al.  A first step toward chromosome analysis by compression algorithms , 1995, Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS'95.

[23]  Gary Benson,et al.  An algorithm for finding tandem repeats of unspecified pattern size , 1998, RECOMB '98.

[24]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[25]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[26]  M. Crochemore Recherche linéaire d'un carré dans un mot , 1983 .

[27]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[28]  M. Waterman,et al.  A method for fast database search for all k-nucleotide repeats. , 1994, Nucleic acids research.