A simple genetic algorithm for multiple sequence alignment.

Multiple sequence alignment plays an important role in molecular sequence analysis. An alignment is the arrangement of two (pairwise alignment) or more (multiple alignment) sequences of 'residues' (nucleotides or amino acids) that maximizes the similarities between them. Algorithmically, the problem consists of opening and extending gaps in the sequences to maximize an objective function (measurement of similarity). A simple genetic algorithm was developed and implemented in the software MSA-GA. Genetic algorithms, a class of evolutionary algorithms, are well suited for problems of this nature since residues and gaps are discrete units. An evolutionary algorithm cannot compete in terms of speed with progressive alignment methods but it has the advantage of being able to correct for initially misaligned sequences; which is not possible with the progressive method. This was shown using the BaliBase benchmark, where Clustal-W alignments were used to seed the initial population in MSA-GA, improving outcome. Alignment scoring functions still constitute an open field of research, and it is important to develop methods that simplify the testing of new functions. A general evolutionary framework for testing and implementing different scoring functions was developed. The results show that a simple genetic algorithm is capable of optimizing an alignment without the need of the excessively complex operators used in prior study. The clear distinction between objective function and genetic algorithms used in MSA-GA makes extending and/or replacing objective functions a trivial task.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  D. E. Goldberg,et al.  Simple Genetic Algorithms and the Minimal, Deceptive Problem , 1987 .

[5]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[6]  William R. Taylor,et al.  Multiple sequence alignment by a pairwise algorithm , 1987, Comput. Appl. Biosci..

[7]  M. Nei,et al.  The neighbor-joining method , 1987 .

[8]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[9]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[10]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[11]  D. G. George,et al.  Mutation data matrix and its uses. , 1990, Methods in enzymology.

[12]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[13]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[14]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[15]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[16]  Wirt Atmar,et al.  Notes on the simulation of evolution , 1994, IEEE Trans. Neural Networks.

[17]  Moon-Jung Chung,et al.  Multiple sequence alignment using simulated annealing , 1994, Comput. Appl. Biosci..

[18]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[19]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[20]  Dan Boneh,et al.  On genetic algorithms , 1995, COLT '95.

[21]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[22]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[23]  Jens Stoye,et al.  Improving the Divide-and-Conquer Approach to Sum-of-Pairs Multiple Sequence Alignment , 1997 .

[24]  Andrew K. C. Wong,et al.  A genetic algorithm for multiple molecular sequence alignment , 1997, Comput. Appl. Biosci..

[25]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[26]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[27]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[28]  Kumar Chellapilla,et al.  Multiple sequence alignment using evolutionary programming , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[29]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[30]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[31]  L. A. Anbarasu,et al.  Multiple molecular sequence alignment by island parallel genetic algorithm , 2000 .

[32]  David B. Fogel,et al.  Evolution-ary Computation 1: Basic Algorithms and Operators , 2000 .

[33]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[34]  Winfried Just,et al.  Computational Complexity of Multiple Sequence Alignment with SP-Score , 2001, J. Comput. Biol..

[35]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .

[36]  Kevin Karplus,et al.  Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set , 2001, Bioinform..

[37]  Paola Bonizzoni,et al.  The complexity of multiple sequence alignment with SP-score that is a metric , 2001, Theor. Comput. Sci..

[38]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[39]  M. Yasunaga,et al.  Aligning multiple protein sequences by parallel hybrid genetic algorithm. , 2002, Genome informatics. International Conference on Genome Informatics.

[40]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[41]  Goldberg,et al.  Genetic algorithms , 1993, Robust Control Systems with Genetic Algorithms.

[42]  Gary B. Fogel,et al.  A Clustal alignment improver using evolutionary algorithms , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[43]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[44]  James A. Foster,et al.  Multiple Sequence Alignment with Evolutionary Computation , 2004, Genetic Programming and Evolvable Machines.

[45]  Yi Wang,et al.  Multiple Sequence Alignment Using Tabu Search , 2004, APBC.

[46]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[47]  Wouter Boomsma,et al.  Multiple Sequence Alignment Using SAGA: Investigating the Effects of Operator Scheduling, Population Seeding, and Crossover Operators , 2004, EvoWorkshops.

[48]  Amitava Datta,et al.  Multiple sequence alignment in parallel on a workstation cluster , 2004, Bioinform..

[49]  Dr. Zbigniew Michalewicz,et al.  How to Solve It: Modern Heuristics , 2004 .

[50]  Desmond G. Higgins,et al.  Evaluation of iterative alignment algorithms for multiple alignment , 2005, Bioinform..

[51]  N. Takahata,et al.  Genetics and Molecular Research , 2006 .

[52]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.