A heuristic algorithm for blocked multiple sequence alignment

Blocked multiple-sequence alignment (BMA) refers to the construction of multiple alignments in DNA by first aligning conserved regions into what we call "blocks" and then aligning the regions between successive blocks to form a final alignment. Instead of starting from low-order pairwise alignments, we propose a new way to form blocks by searching for closely related regions in all input sequences, allowing internal spaces in blocks as well as some degree of mismatch. We address the problem of semi-conserved patterns (patterns that do not appear in all input sequences) by introducing into the process two similarity thresholds that are adjusted dynamically according to the input. A method to control the number of blocks is also presented to deal with the situation when input sequences have so many similar regions that it becomes impractical to form blocks by trying every combination. BMA is an implementation of this approach, and our experimental results indicate that this approach is efficient, particularly on large numbers of long sequences with well-conserved regions.

[1]  Balaji Raghavachari,et al.  Chaining Multiple-Alignment Blocks , 1994, J. Comput. Biol..

[2]  Webb Miller Building multiple alignments from pairwise alignments , 1993, Comput. Appl. Biosci..

[3]  H. M. Martinez,et al.  A multiple sequence alignment program , 1986, Nucleic Acids Res..

[4]  Shuji Tsukiyama,et al.  A New Algorithm for Generating All the Maximal Independent Sets , 1977, SIAM J. Comput..

[5]  Aris Floratos,et al.  An Approximation Algorithm for Alignment of Multiple Sequences using Motif Discovery , 1999, J. Comb. Optim..

[6]  Balaji Raghavachari,et al.  Constructing Aligned Sequence Blocks , 1994, J. Comput. Biol..

[7]  M S Waterman,et al.  Consensus methods for DNA and protein sequence alignment. , 1990, Methods in enzymology.

[8]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[9]  M S Boguski,et al.  Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using new software tools for multiple alignment and visualization. , 1992, The New biologist.

[10]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[11]  David Eppstein,et al.  Sparse dynamic programming II: convex and concave cost functions , 1992, JACM.

[12]  David Eppstein,et al.  Sparse dynamic programming I: linear cost functions , 1992, JACM.

[13]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.

[14]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[15]  David R. Gilbert,et al.  Approaches to the Automatic Discovery of Patterns in Biosequences , 1998, J. Comput. Biol..

[16]  David Maier,et al.  The Complexity of Some Problems on Subsequences and Supersequences , 1978, JACM.

[17]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[18]  G J Barton,et al.  Evaluation and improvements in the automatic alignment of protein sequences. , 1987, Protein engineering.

[19]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[20]  B. He,et al.  Local Multiple Alignment Via Subgraph Enumeration , 1996, Discret. Appl. Math..

[21]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.