Flexible Identification of Structural Objects in Nucleic Acid Sequences: Palindromes, Mirror Repeats, Pseudoknots and Triple Helices

This paper presents algorithms for flexibly identifying structural objects in nucleic acid sequences. These objects are palindromes, mirror repeats, pseudoknots and triple helices. We further explore here the idea of a model against which the words in a sequence are compared for finding these structural objects [17]. In the present case, models are words defined over the alphabet of nucleotides that have both direct and inverse occurrences in the sequence. Moreover, errors (substitutions, deletions and insertions) are allowed between a model and its inverse occurrences. Helix stems may therefore present bulges or interior loops, and mirror repeats need not be exact. Reasonably efficient performance comes from the fact that the parts composing the structures are kept separated until the end and that filtering for valid occurrences (occurrences that may form part of such a structure) can be done in O(n) time where n is the length of the sequence. The time complexity for the searching phase (that is, before the structural parts are put together at the end) of both algorithms presented here (one for palindromes and mirror repeats, the other for pseudoknots and triple helices) is then O(nk(e+1)(1+min d max -d min +1+e, k e ∣Σ∣ e )) where n is the length of the sequence, d max and d min are, respectively, the maximal and minimal length of a hairpin loop, k is either the maximum length k max of a model, is a fixed length or represents the maximum value of a range of lengths, e is the maximum number of errors allowed (substitutions, deletions and insertions) and ∣Σ∣ is the size of the alphabet of nucleotides.

[1]  H. M. Martinez,et al.  An efficient method for finding repeats in molecular sequences , 1983, Nucleic Acids Res..

[2]  N. A. Kolchanov,et al.  Chemical and Computer Probing of RNA Structure1 , 1996, Progress in Nucleic Acid Research and Molecular Biology.

[3]  Eugene W. Myers,et al.  An O(NP) Sequence Comparison Algorithm , 1990, Inf. Process. Lett..

[4]  Alain Viari,et al.  A Distance-Based Block Searching Algorithm , 1995, ISMB.

[5]  Alain Viari,et al.  A Double Combinatorial Approach to Discovering Patterns in Biological Sequences , 1996, CPM.

[6]  J. Abrahams,et al.  Prediction of RNA secondary structure, including pseudoknotting, by computer simulation. , 1990, Nucleic acids research.

[7]  David Sankoff,et al.  RNA secondary structures and their prediction , 1984 .

[8]  A. Viari,et al.  Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence database. , 1996, Nucleic acids research.

[9]  Alain Viari,et al.  Multiple Sequence Comparison: A Peptide Matching Approach , 1995, CPM.

[10]  C. Pleij,et al.  RNA pseudoknots: structure, detection, and prediction. , 1989, Methods in enzymology.

[11]  A. Stewart Genes V , 1994 .

[12]  M. Waterman,et al.  Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. , 1985, Journal of molecular biology.

[13]  Alain Viari,et al.  Searching for Repeated Words in a Text Allowing for Mismatches and Gaps , 1995 .

[14]  C. C. Hardin,et al.  RNA structure from A to Z. , 1987, Cold Spring Harbor symposia on quantitative biology.

[15]  M Brown,et al.  RNA pseudoknot modeling using intersections of stochastic context free grammars with applications to database search. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[16]  Fabrice Lefebvre An Optimized Parsing Algorithm Well Suited to RNA Folding , 1995, ISMB.

[17]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[18]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[19]  Jih-Hsiang Chen,et al.  A procedure for RNA pseudoknot prediction , 1992, Comput. Appl. Biosci..

[20]  David B. Searls,et al.  The Linguistics of DNA , 1992 .

[21]  S. Mirkin,et al.  DNA H form requires a homopurine–homopyrimidine mirror repeat , 1987, Nature.

[22]  H. M. Martinez Detecting pseudoknots and other local base-pairing structures in RNA sequences. , 1990, Methods in enzymology.