Optimal computation of all tandem repeats in a weighted sequence

BackgroundTandem duplication, in the context of molecular biology, occurs as a result of mutational events in which an original segment of DNA is converted into a sequence of individual copies. More formally, a repetition or tandem repeat in a string of letters consists of exact concatenations of identical factors of the string. Biologists are interested in approximate tandem repeats and not necessarily only in exact tandem repeats. A weighted sequence is a string in which a set of letters may occur at each position with respective probabilities of occurrence. It naturally arises in many biological contexts and provides a method to realise the approximation among distinct adjacent occurrences of the same DNA segment.ResultsCrochemore’s repetitions algorithm, also referred to as Crochemore’s partitioning algorithm, was introduced in 1981, and was the first optimal O(nlogn)-time algorithm to compute all repetitions in a string of length n. In this article, we present a novel variant of Crochemore’s partitioning algorithm for weighted sequences, which requires optimal O(nlogn) time, thus improving on the best known On2-time algorithm (Zhang et al., 2013) for computing all repetitions in a weighted sequence of length n.

[1]  Wojciech Rytter,et al.  Repetitions in strings: Algorithms and combinatorics , 2009, Theor. Comput. Sci..

[2]  Gregory Kucherov,et al.  Finding Approximate Repetitions under Hamming Distance , 2001, ESA.

[3]  Portland Press Ltd Nomenclature Committee for the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1985, Molecular biology and evolution.

[4]  Costas S. Iliopoulos,et al.  Computing the Repetitions in a Biological Weighted Sequence , 2005, J. Autom. Lang. Comb..

[5]  Costas S. Iliopoulos,et al.  Computation of Repetitions and Regularities of Biologically Weighted Sequences , 2006, J. Comput. Biol..

[6]  Alex van Belkum,et al.  Short-Sequence DNA Repeats in Prokaryotic Genomes , 1998, Microbiology and Molecular Biology Reviews.

[7]  Costas S. Iliopoulos,et al.  The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications , 2006, Fundam. Informaticae.

[8]  A. Cornish-Bowden Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. , 1985, Nucleic acids research.

[9]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[10]  Nomenclature committee of the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1986, The Journal of biological chemistry.

[11]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[12]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[13]  Mireille Régnier,et al.  Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression , 2006, Bioinform..

[14]  M. Mitas,et al.  Trinucleotide repeats associated with human disease. , 1997, Nucleic acids research.

[15]  Costas S. Iliopoulos,et al.  Locating tandem repeats in weighted sequences in proteins , 2013, BMC Bioinformatics.

[16]  Solon P. Pissis,et al.  Optimal Computation of all Repetitions in a Weighted String , 2014, ICABD.

[17]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[18]  Jing Fan,et al.  Loose and strict repeats in weighted sequences of proteins. , 2010, Protein and peptide letters.

[19]  Costas S. Iliopoulos,et al.  Motif Extraction from Weighted Sequences , 2004, SPIRE.

[20]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).