A genetic algorithm for multiple molecular sequence alignment

MOTIVATION Multiple molecular sequence alignment is among the most important and most challenging tasks in computational biology. The currently used alignment techniques are characterized by great computational complexity, which prevents their wider use. This research is aimed at developing a new technique for efficient multiple sequence alignment. APPROACH The new method is based on genetic algorithms. Genetic algorithms are stochastic approaches for efficient and robust searching. By converting biomolecular sequence alignment into a problem of searching for optimal or near-optimal points in an 'alignment space', a genetic algorithm can be used to find good alignments very efficiently. RESULTS Experiments on real data sets have shown that the average computing time of this technique may be two or three orders lower than that of a technique based on pairwise dynamic programming, while the alignment qualities are very similar. AVAILABILITY A C program on UNIX has been written to implement the technique. It is available on request from the authors.

[1]  J. Lakey,et al.  The bacterial porin superfamily: sequence alignment and structure prediction , 1991, Molecular microbiology.

[2]  M S Waterman,et al.  Multiple sequence alignment by consensus. , 1986, Nucleic acids research.

[3]  C. Pleij,et al.  An APL-programmed genetic algorithm for the prediction of RNA secondary structure. , 1995, Journal of theoretical biology.

[4]  Webb Miller Building multiple alignments from pairwise alignments , 1993, Comput. Appl. Biosci..

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  David Sankoff,et al.  A strategy for sequence phylogeny research , 1982, Nucleic Acids Res..

[7]  Albert Donally Bethke,et al.  Genetic Algorithms as Function Optimizers , 1980 .

[8]  John J. Grefenstette,et al.  Genetic Search with Approximate Function Evaluation , 1985, ICGA.

[9]  D Gusfield,et al.  Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993, Bulletin of mathematical biology.

[10]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[11]  Kenneth A. De Jong,et al.  Using genetic algorithms for supervised concept learning , 1990, [1990] Proceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence.

[12]  P. Argos,et al.  Potential of genetic algorithms in protein folding and protein engineering simulations. , 1992, Protein engineering.

[13]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[14]  Olli Nevalainen,et al.  MULTICOMP: a program package for multiple sequence comparison , 1992, Comput. Appl. Biosci..

[15]  R Unger,et al.  Genetic algorithms for protein folding simulations. , 1992, Journal of molecular biology.

[16]  D. E. Goldberg,et al.  Simple Genetic Algorithms and the Minimal, Deceptive Problem , 1987 .

[17]  C. Pleij,et al.  The influence of a metastable structure in plasmid primer RNA on antisense RNA binding kinetics. , 1995, Nucleic acids research.

[18]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[19]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[20]  John H. Holland Genetic Algorithms and Classifier Systems: Foundations and Future Directions , 1987, ICGA.

[21]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[22]  D Sankoff,et al.  Matching sequences under deletion-insertion constraints. , 1972, Proceedings of the National Academy of Sciences of the United States of America.

[23]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[24]  A. Apostolio,et al.  A Fast Linear Space Algorithm for Computing Longest Common Subsequences , 1985 .

[25]  A. K. Wong,et al.  A survey of multiple sequence comparison methods. , 1992, Bulletin of mathematical biology.

[26]  P. Argos,et al.  Motif recognition and alignment for many sequences by comparison of dot-matrices. , 1991, Journal of molecular biology.

[27]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[28]  John J. Grefenstette,et al.  Optimization of Control Parameters for Genetic Algorithms , 1986, IEEE Transactions on Systems, Man, and Cybernetics.

[29]  O. Gotoh Consistency of optimal sequence alignments. , 1990, Bulletin of Mathematical Biology.

[30]  Carol A. Ankenbrandt An Extension to the Theory of Convergence and a Proof of the Time Complexity of Genetic Algorithms , 1990, FOGA.

[31]  Mikhail A. Roytberg A search for common patterns in many sequences , 1992, Comput. Appl. Biosci..

[32]  Frederick E. Petry,et al.  Schema survival rates and heuristic search in genetic algorithms , 1990, [1990] Proceedings of the 2nd International IEEE Conference on Tools for Artificial Intelligence.