Multiple Sequence Alignment Using an Exhaustive and Greedy Algorithm

We describe an exhaustive and greedy algorithm for improving the accuracy of multiple sequence alignment. A simple progressive alignment approach is employed to provide initial alignments. The initial alignment is then iteratively optimized against an objective function. For any working alignment, the optimization involves three operations: insertions, deletions and shuffles of gaps. The optimization is exhaustive since the algorithm applies the above operations to all eligible positions of an alignment. It is also greedy since only the operation that gives the best improving objective score will be accepted. The algorithms have been implemented in the EGMA (Exhaustive and Greedy Multiple Alignment) package using Java programming language, and have been evaluated using the BAliBASE benchmark alignment database. Although EGMA is not guaranteed to produce globally optimized alignment, the tests indicate that EGMA is able to build alignments with high quality consistently, compared with other commonly used iterative and non-iterative alignment programs. It is also useful for refining multiple alignments obtained by other methods.

[1]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[2]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[3]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[4]  Smith Rf,et al.  Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling. , 1992 .

[5]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[6]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[8]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[9]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[10]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[13]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.