A Heuristic Algorithm for Multiple Sequence Alignment Based on Blocks

Blocked multiple sequence alignment refers to the construction of multiple alignment by first aligning conserved regions into what we call “blocks” and then aligning the regions between successive blocks to form a final alignment. Instead of starting from low order pairwise alignments we propose a new way to form blocks by searching for closely related regions in all input sequences, allowing internal spaces in blocks as well as some degree of mismatch. We address the problem of semi-conserved patterns (patterns that do not appear in all input sequences) by introducing into the process two similarity thresholds that are adjusted dynamically according to the input. A method to control the number of blocks is also presented to deal with the situation when input sequences have so many similar regions that it becomes impractical to form blocks by trying every combination. BMA is an implementation of this approach, and our experimental results indicatethat this approach is efficient, particularly on large numbers of long sequences with well-conserved regions.

[1]  M S Waterman,et al.  Consensus methods for DNA and protein sequence alignment. , 1990, Methods in enzymology.

[2]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.

[3]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[4]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[5]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[6]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[7]  H. M. Martinez,et al.  A multiple sequence alignment program , 1986, Nucleic Acids Res..

[8]  B. He,et al.  Local Multiple Alignment Via Subgraph Enumeration , 1996, Discret. Appl. Math..

[9]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[10]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[11]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[12]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[13]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[14]  David Eppstein,et al.  Sparse dynamic programming II: convex and concave cost functions , 1992, JACM.

[15]  Balaji Raghavachari,et al.  Chaining Multiple-Alignment Blocks , 1994, J. Comput. Biol..

[16]  Webb Miller Building multiple alignments from pairwise alignments , 1993, Comput. Appl. Biosci..

[17]  M S Boguski,et al.  Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using new software tools for multiple alignment and visualization. , 1992, The New biologist.

[18]  Aris Floratos,et al.  An Approximation Algorithm for Alignment of Multiple Sequences using Motif Discovery , 1999, J. Comb. Optim..

[19]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[20]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[21]  Shuji Tsukiyama,et al.  A New Algorithm for Generating All the Maximal Independent Sets , 1977, SIAM J. Comput..

[22]  Balaji Raghavachari,et al.  Constructing Aligned Sequence Blocks , 1994, J. Comput. Biol..