An improved algorithm for the regular expression constrained multiple sequence alignment problem

Constrained sequence alignment has been proposed as a way for incorporating biologists' knowledge about common structures or functions into the alignment process. For alignment of protein sequences, several studies have suggested taking into account the motifs (a restricted regular expression) from the PROSITE database to guide alignments. The regular expression constrained sequence alignment has been introduced for this purpose. An alignment satisfies the constraint if part of it matches a given regular expression in each dimension (i.e. in each sequence aligned). There is a method that rewards the alignments that include a region matching the given regular expression. This method does not always guarantee the satisfaction of the constraint. Another method constructs a weighted finite automaton from the given regular expression, and presents a dynamic programming solution that simulates copies of this automaton to find an alignment with maximum score satisfying the regular expression constraint. We propose a new algorithm for the regular expression constrained multiple sequence alignment problem. Our algorithm considers two layers each of which corresponds to part of the dynamic programming matrix for the alignment of the given sequences. We compute each layer differently using dynamic programming. We propose the following modification in the definition of the problem: the region satisfying the constraint does not contribute to the total score. This modification is not necessary for the correctness and the performance in certain cases such as the constraint involves only one motif or motif-matching regions span short distance in each sequence but we believe that with this modification we achieve the same goal by doing less work in practice. Our algorithm is much more efficient than a previously proposed algorithm that uses weighted automata, and its performance in practice is comparable to (and under certain conditions even better than) that of the ordinary (unconstrained) multiple sequence alignment algorithm. Our experiments on real biological sequences, and regular expressions each composed of a sequence of motifs verify this

[1]  Craig A. Stewart,et al.  Introduction to computational biology , 2005 .

[2]  J. Walker,et al.  Distantly related sequences in the alpha‐ and beta‐subunits of ATP synthase, myosin, kinases and other ATP‐requiring enzymes and a common nucleotide binding fold. , 1982, The EMBO journal.

[3]  Ömer Egecioglu,et al.  Algorithms For The Constrained Longest Common Subsequence Problems , 2005, Int. J. Found. Comput. Sci..

[4]  Jean-Paul Comet,et al.  Pairwise Sequence Alignment using a PROSITE Pattern-derived Similarity Score , 2002, Comput. Chem..

[5]  Alfredo De Santis,et al.  A simple algorithm for the constrained sequence problems , 2004, Information Processing Letters.

[6]  P. Gács,et al.  Algorithms , 1992 .

[7]  Yin-Te Tsai,et al.  Constrained multiple sequence alignment tool development and its application to RNase family alignment , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[8]  Dan He,et al.  A parallel algorithm for the constrained multiple sequence alignment problem , 2005, Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05).

[9]  Abdullah N. Arslan Multiple Sequence Alignment Containing a Sequence of Regular Expressions , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[10]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[11]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[12]  Abdullah N. Arslan Regular expression constrained sequence alignment , 2007, J. Discrete Algorithms.

[13]  Yin-Te Tsai,et al.  The constrained longest common subsequence problem , 2003, Inf. Process. Lett..

[14]  Yin-Te Tsai,et al.  MuSiC: a tool for multiple sequence alignment with constraints , 2004, Bioinform..

[15]  Prudence W. H. Wong,et al.  Efficient constrained multiple sequence alignment with performance guarantee , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.