Regular expression constrained sequence alignment

We introduce regular expression constrained sequence alignment as the problem of finding the maximum alignment score between given strings S"1 and S"2 over all alignments such that in these alignments there exists a segment where some substring s"1 of S"1 is aligned to some substring s"2 of S"2, and both s"1 and s"2 match a given regular expression R, i.e. s"1,s"[email protected]?L(R) where L(R) is the regular language described by R. For complexity results we assume, without loss of generality, that n=|S"1|>=|m|=|S"2|. A motivation for the problem is that protein sequences can be aligned in a way that known motifs guide the alignments. We present an O(nmr) time algorithm for the regular expression constrained sequence alignment problem where r=O(t^4), and t is the number of states of a nondeterministic finite automaton N that accepts L(R). We use in our algorithm a nondeterministic weighted finite automaton M that we construct from N. M has O(t^2) states where the transition-weights are obtained from the given costs of edit operations, and state-weights correspond to optimum alignment scores we compute using the underlying dynamic programming solution for sequence alignment. If we are given a deterministic finite automaton D accepting L(R) with t"d states then our construction creates a deterministic finite automaton M"d with t"d^2 states. In this case, our algorithm takes O(t"d^2nm) time. Using M"d results in faster computation than using M when t"d

[1]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[2]  Prudence W. H. Wong,et al.  Efficient constrained multiple sequence alignment with performance guarantee , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[3]  Yin-Te Tsai,et al.  The constrained longest common subsequence problem , 2003, Inf. Process. Lett..

[4]  Yin-Te Tsai,et al.  MuSiC: a tool for multiple sequence alignment with constraints , 2004, Bioinform..

[5]  P. Bork,et al.  Protein sequence motifs. , 1996, Current opinion in structural biology.

[6]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[7]  J. Walker,et al.  Distantly related sequences in the alpha‐ and beta‐subunits of ATP synthase, myosin, kinases and other ATP‐requiring enzymes and a common nucleotide binding fold. , 1982, The EMBO journal.

[8]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[9]  Daniel S. Hirschberg,et al.  Algorithms for the Longest Common Subsequence Problem , 1977, JACM.

[10]  Alfredo De Santis,et al.  A simple algorithm for the constrained sequence problems , 2004, Information Processing Letters.

[11]  Jean-Paul Comet,et al.  Pairwise Sequence Alignment using a PROSITE Pattern-derived Similarity Score , 2002, Comput. Chem..

[12]  Yin-Te Tsai,et al.  Constrained multiple sequence alignment tool development and its application to RNase family alignment , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[13]  E. Myers,et al.  Approximate matching of regular expressions. , 1989, Bulletin of mathematical biology.

[14]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[15]  Ömer Egecioglu,et al.  Algorithms For The Constrained Longest Common Subsequence Problems , 2005, Int. J. Found. Comput. Sci..