Multiple sequence alignment using a genetic algorithm and GLOCSA

Algorithms that minimize putative synapomorphy in an alignment cannot be directly implemented since trivial cases with concatenated sequences would be selected because they would imply a minimum number of events to be explained (e.g., a single insertion/deletion would be required to explain divergence among two sequences). Therefore, indirectmeasures to approach parsimony need to be implemented. In this paper, we thoroughly present a Global Criterion for Sequence Alignment (GLOCSA) that uses a scoring function to globally rate multiple alignments aiming to produce matrices that minimize the number of putative synapomorphies. We also present a Genetic Algorithm that uses GLOCSA as the objective function to produce sequence alignments refining alignments previously generated by additional existing alignment tools (we recommend MUSCLE). We show that in the example cases our GLOCSA-guided Genetic Algorithm (GGGA) does improve the GLOCSA values, resulting in alignments that imply less putative synapomorphies.

[1]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[2]  Cedric Notredame,et al.  Using Genetic Algorithms for Pairwise and Multiple Sequence Alignments , 2003 .

[3]  William S. Klug,et al.  Concepts of Genetics , 1999 .

[4]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[5]  Kumar Chellapilla,et al.  Multiple sequence alignment using evolutionary programming , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[6]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[7]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[8]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[9]  D. Lipman,et al.  Trees, stars, and multiple biological sequence alignment , 1989 .

[10]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[11]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[12]  H. Ochoterena Homology in coding and non-coding DNA sequences: a parsimony perspective , 2008, Plant Systematics and Evolution.

[13]  Akihiko Konagaya,et al.  Parallel Iterative Aligner with Genetic Algorithm , 1993 .

[14]  Thomas Wiehe,et al.  Introduction to computational biology - an evolutionary approach , 2006 .

[15]  Liming Cai,et al.  Evolutionary computation techniques for multiple sequence alignment , 2000, Proceedings of the 2000 Congress on Evolutionary Computation. CEC00 (Cat. No.00TH8512).

[16]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[17]  D. Higgins,et al.  RAGA: RNA sequence alignment by genetic algorithm. , 1997, Nucleic acids research.

[18]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[19]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[20]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[21]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Jerrold I. Davis,et al.  Homology in Molecular Phylogenetics: A Parsimony Perspective , 1998 .