Speeding up the detection of tandem repeats over the edit distance

A tandem repeat in a string is a sequence of two or more contiguous, approximate copies of a pattern. Finding an efficient, deterministic algorithm to perform an exhaustive search for tandem repeats in a string is an important and practical problem, due to the relevance of tandem repeats in several areas, including human identity testing, disease diagnosis, sequence homology, and population studies. In this paper, we present an O(nklog^2k+Occ) algorithm to find all approximate tandem repeats within a sequence of length n, allowing at most k insertions, deletions and mismatches in each repeat. Our algorithm utilizes the Lempel-Ziv factorization which was previously used in algorithms that locate exact tandem repeats, and algorithms that locate tandem repeats with only mismatches. The LZ framework is combined with speedups for calculating the edit distance, achieving a new and efficient exhaustive search for finding tandem repeats in a string.

[1]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 2001, J. Comput. Biol..

[2]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[3]  M W Bruford,et al.  Microsatellites and their application to population genetic studies. , 1993, Current opinion in genetics & development.

[4]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[5]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  G. Spong,et al.  A Near-extinction Event in Lynx: Do Microsatellite Data Tell the Tale? , 2002 .

[7]  Lucian Ilie,et al.  Computing Longest Previous Factor in linear time and applications , 2008, Inf. Process. Lett..

[8]  Esko Ukkonen,et al.  On Approximate String Matching , 1983, FCT.

[9]  Gang Chen,et al.  Lempel–Ziv Factorization Using Less Time & Space , 2008, Math. Comput. Sci..

[10]  A J Jeffreys,et al.  DNA typing: approaches and applications. , 1993, Journal - Forensic Science Society.

[11]  Michael G. Main,et al.  Detecting leftmost maximal periodicities , 1989, Discret. Appl. Math..

[12]  Laurent Mouchard,et al.  Speeding up the detection of evolutive tandem repeats , 2004, Theor. Comput. Sci..

[13]  Gang Chen,et al.  Fast and Practical Algorithms for Computing All the Runs in a String , 2007, CPM.

[14]  Filippo Aluffi-Pentini,et al.  STRING: finding tandem repeats in DNA sequences , 2003, Bioinform..

[15]  Gregory Kucherov,et al.  Finding Approximate Repetitions under Hamming Distance , 2001, ESA.

[16]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[17]  Gregory Kucherov,et al.  Approximate Tandem Repeats , 2008, Encyclopedia of Algorithms.

[18]  Karen Usdin,et al.  The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. , 2008, Genome research.

[19]  Michael G. Main,et al.  An O(n log n) Algorithm for Finding All Repetitions in a String , 1984, J. Algorithms.

[20]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[21]  Franco P. Preparata,et al.  A Novel Approach to the Detection of Genomic Approximate Tandem Repeats in the Levenshtein Metric , 2007, J. Comput. Biol..

[22]  Alessio Vecchio,et al.  TRStalker: an efficient heuristic for finding fuzzy tandem repeats , 2010, Bioinform..

[23]  Eugene W. Myers,et al.  Identifying Satellites and Periodic Repetitions in Biological Sequences , 1998, J. Comput. Biol..

[24]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[25]  Huda Y. Zoghbi,et al.  Diseases of Unstable Repeat Expansion: Mechanisms and Common Principles , 2005, Nature Reviews Genetics.

[26]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[27]  Dan Geiger,et al.  Finding approximate tandem repeats in genomic sequences , 2004, RECOMB.

[28]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[29]  Lucian Ilie,et al.  A comparison of index-based lempel-Ziv LZ77 factorization algorithms , 2012, CSUR.

[30]  S. Mirkin,et al.  DNA structures, repeat expansions and human hereditary disorders. , 2006, Current opinion in structural biology.

[31]  Gary Benson,et al.  Tandem repeats over the edit distance , 2007, Bioinform..

[32]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[33]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.