Identifying Satellites and Periodic Repetitions in Biological Sequences

We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30-40 base pairs) approximate tandem repeats where copies may differ up to epsilon = 15-20% from a consensus model of the repeating unit (implying individual units may vary by 2 epsilon from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10(4) when epsilon = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repeated unit as well as the span of the satellite. The first phase was designed for efficiency and takes only O (n) time where n is the length of the sequence. The second phase was designed for sensitivity and takes time O (n . N (e, k)) in the worst case where k is the length of the repeating unit m, e = [epsilon k] is the number of differences allowed between each repeat unit and the model m, and N (e, k) is the maximum number of words that are not more than e differences from another word of length k. That is, N (e, k) is the maximum size of an e-neighborhood of a string of length k. Experiments reveal the second phase to be considerably faster in practice than the worst-case complexity bound suggests. Finally, the present algorithm is easily adapted to finding tandem repeats in protein sequences, as well as extended to identifying mixed direct-inverse tandem repeats.

[1]  Gad M. Landau,et al.  Identifying Periodic Occurrences of a Template with Applications to Protein Structures , 1992, CPM.

[2]  Gary Benson,et al.  An algorithm for finding tandem repeats of unspecified pattern size , 1998, RECOMB '98.

[3]  Wolfgang Stephan,et al.  The evolutionary dynamics of repetitive DNA in eukaryotes , 1994, Nature.

[4]  David Haussler,et al.  Sequence landscapes , 1986, Nucleic Acids Res..

[5]  Vincent A. Fischetti,et al.  Identifying Periodic Occurrences of a Template with Applications to Protein Structure , 1993, Inf. Process. Lett..

[6]  Alain Viari,et al.  Searching for Repeated Words in a Text Allowing for Mismatches and Gaps , 1995 .

[7]  Arnold L. Rosenberg,et al.  Rapid identification of repeated patterns in strings, trees and arrays , 1972, STOC.

[8]  S Karlin,et al.  Efficient algorithms for molecular sequence analysis. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Maxime Crochemore,et al.  Direct Construction of Compact Directed Acyclic Word Graphs , 1997, CPM.

[10]  M. Waterman,et al.  A method for fast database search for all k-nucleotide repeats. , 1994, Nucleic acids research.

[11]  Alain Viari,et al.  A Double Combinatorial Approach to Discovering Patterns in Biological Sequences , 1996, CPM.

[12]  Sampath Kannan,et al.  An Algorithm for Locating Non-Overlapping Regions of Maximum Alignment Score , 1993, CPM.

[13]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[14]  Maxime Crochemore,et al.  An Optimal Algorithm for Computing the Repetitions in a Word , 1981, Inf. Process. Lett..

[15]  Olivier Delgrange,et al.  Un algorithme rapide pour une compression modulaire optimale : application à l'analyse de séquences génétiques , 1997 .

[16]  Jean-Paul Delahaye,et al.  Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences , 1997, Comput. Appl. Biosci..

[17]  Gad M. Landau,et al.  An Algorithm for Approximate Tandem Repeats , 1993, CPM.

[18]  S Karlin,et al.  An efficient algorithm for identifying matches with errors in multiple long molecular sequences. , 1991, Journal of molecular biology.

[19]  Aleksandar Milosavljevic,et al.  Discovering simple DNA sequences by the algorithmic significance method , 1993, Comput. Appl. Biosci..

[20]  Max Dauchet,et al.  A first step toward chromosome analysis by compression algorithms , 1995, Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS'95.

[21]  Udi Manber,et al.  A Sub-quadratic Algorithm for Approximate Limited Expression Matching 1 , 1992 .