Multiple sequence alignment: algorithms and applications.

Elucidation of interrelationships among sequence, structure, function, and evolution (FESS relationships) of a family of genes or gene products is a central theme of modern molecular biology. Multiple sequence alignment has been proven to be a powerful tool for many fields of studies such as phylogenetic reconstruction, illumination of functionally important regions, and prediction of higher order structures of proteins and RNAs. However, it is far too trivial to automatically construct a multiple alignment from a set of related sequences. A variety of methods for solving this computationally difficult problem are reviewed. Several important applications of multiple alignment for elucidation of the FESS relationships are also discussed. For a long period, progressive methods have been the only practical means to solve a multiple alignment problem of appreciable size. This situation is now changing with the development of new techniques including several classes of iterative methods. Today's progress in multiple sequence alignment methods has been made by the multidisciplinary endeavors of mathematicians, computer scientists, and biologists in various fields including biophysicists in particular. The ideas are also originated from various backgrounds, pure algorithmics, statistics, thermodynamics, and others. The outcomes are now enjoyed by researchers in many fields of biological sciences. In the near future, generalized multiple alignment may play a central role in studies of FESS relationships. The organized mixture of knowledge from multiple fields will ferment to develop fruitful results which would be hard to obtain within each area. I hope this review provides a useful information resource for future development of theory and practice in this rapidly expanding area of bioinformatics.

[1]  M. Zuker Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.

[2]  M. Sternberg,et al.  Towards an automatic method of predicting protein structure by homology: an evaluation of suboptimal sequence alignments. , 1992, Protein engineering.

[3]  Bruce W. Erickson,et al.  Optimal sequence alignment using affine gap costs , 1986 .

[4]  B. Rost,et al.  Redefining the goals of protein secondary structure prediction. , 1994, Journal of molecular biology.

[5]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[6]  A. K. Wong,et al.  A survey of multiple sequence comparison methods. , 1992, Bulletin of mathematical biology.

[7]  S Pascarella,et al.  A databank (3D-ali) collecting related protein sequences and structures. , 1996, Protein engineering.

[8]  M. Fredman,et al.  Algorithms for computing evolutionary similarity measures with length independent gap penalties , 1984 .

[9]  J. P. Dumas,et al.  Efficient algorithms for folding and comparing nucleic acid sequences , 1982, Nucleic Acids Res..

[10]  Chris Sander,et al.  The FSSP database: fold classification based on structure-structure alignment of proteins , 1996, Nucleic Acids Res..

[11]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[12]  P. Argos,et al.  Quantification of secondary structure prediction improvement using multiple alignments. , 1993, Protein engineering.

[13]  James W. Fickett,et al.  Fast optimal alignment , 1984, Nucleic Acids Res..

[14]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[15]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[16]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[17]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[18]  Christophe G. Lambert,et al.  Comparative analysis of seven multiple protein sequence alignment servers: clues to enhance reliability of predictions , 1998, Bioinform..

[19]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[20]  Miguel A. Andrade-Navarro,et al.  Computational space reduction and parallelization of a new clustering approach for large groups of sequences , 1998, Bioinform..

[21]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[22]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .

[23]  D A Morrison,et al.  Effects of nucleotide sequence alignment on phylogeny estimation: a case study of 18S rDNAs of apicomplexa. , 1997, Molecular biology and evolution.

[24]  J Hein,et al.  A new method that simultaneously aligns and reconstructs ancestral sequences for any number of homologous sequences, when the phylogeny is given. , 1989, Molecular biology and evolution.

[25]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[26]  G D Schuler,et al.  A workbench for multiple alignment construction and analysis , 1991, Proteins.

[27]  John P. Overington,et al.  Alignment and searching for common protein folds using a data bank of structural templates. , 1993, Journal of molecular biology.

[28]  P. Argos,et al.  Determination of reliable regions in protein sequence alignments. , 1990, Protein engineering.

[29]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[30]  D. Lipman,et al.  Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[31]  M. Vingron,et al.  Quantifying the local reliability of a sequence alignment. , 1996, Protein engineering.

[32]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[33]  H. M. Martinez,et al.  A multiple sequence alignment program , 1986, Nucleic Acids Res..

[34]  T. Blundell,et al.  Definition of general topological equivalence in protein structures. A procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. , 1990, Journal of molecular biology.

[35]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[36]  John P. Overington,et al.  Derivation of rules for comparative protein modeling from a database of protein structure alignments , 1994, Protein science : a publication of the Protein Society.

[37]  T. Smith,et al.  Functional genomics--bioinformatics is ready for the challenge. , 1998, Trends in genetics : TIG.

[38]  W. Bains,et al.  MULTAN: a program to align multiple DNA sequences , 1986, Nucleic Acids Res..

[39]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[40]  M S Waterman,et al.  Multiple sequence alignment by consensus. , 1986, Nucleic acids research.

[41]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[42]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[43]  S Karlin,et al.  A symmetric-iterated multiple alignment of protein sequences. , 1998, Journal of molecular biology.

[44]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[45]  J. Thompson,et al.  Using CLUSTAL for multiple sequence alignments. , 1996, Methods in enzymology.

[46]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[47]  T G Marr,et al.  Alignment of molecular sequences seen as random path analysis. , 1995, Journal of theoretical biology.

[48]  J. Hein,et al.  A tree reconstruction method that is economical in the number of pairwise comparisons used. , 1989, Molecular biology and evolution.

[49]  Martin Vingron,et al.  Multiple Sequence Comparison and Consistency on Multipartite Graphs , 1995 .

[50]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[51]  E J Milner-White,et al.  Mix'n'Match: an improved multiple sequence alignment procedure for distantly related proteins using secondary structure predictions, designed to be independent of the choice of gap penalty and scoring matrix. , 1993, Protein engineering.

[52]  O. Gotoh,et al.  Substrate recognition sites in cytochrome P450 family 2 (CYP2) proteins inferred from comparative analyses of amino acid and coding nucleotide sequences. , 1992, The Journal of biological chemistry.

[53]  J A Lake,et al.  The order of sequence alignment can bias the selection of tree topology. , 1991, Molecular biology and evolution.

[54]  E. G. Shpaer,et al.  Sensitivity and selectivity in protein similarity searches: a comparison of Smith-Waterman in hardware to BLAST and FASTA. , 1996, Genomics.

[55]  Raffaele Giancarlo,et al.  Speeding up Dynamic Programming with Applications to Molecular Biology , 1989, Theor. Comput. Sci..

[56]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[57]  R F Doolittle,et al.  Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. , 1996, Methods in enzymology.

[58]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[59]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[60]  M. Kimura,et al.  The neutral theory of molecular evolution. , 1983, Scientific American.

[61]  O. Gotoh Consistency of optimal sequence alignments. , 1990, Bulletin of Mathematical Biology.

[62]  P. K. Mehta,et al.  A simple and fast approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70% , 1995, Protein science : a publication of the Protein Society.

[63]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[64]  Kun-Mao Chao,et al.  Recent Developments in Linear-Space Alignment Methods: A Survey , 1994, J. Comput. Biol..

[65]  E S Lander,et al.  Recognition of related proteins by iterative template refinement (ITR) , 1994, Protein science : a publication of the Protein Society.

[66]  S Subbiah,et al.  A method for multiple sequence alignment with gaps. , 1989, Journal of molecular biology.

[67]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[68]  B. Rost,et al.  Protein fold recognition by prediction-based threading. , 1997, Journal of molecular biology.

[69]  R. King,et al.  Identification and application of the concepts important for accurate and reliable protein secondary structure prediction , 1996, Protein science : a publication of the Protein Society.

[70]  T. P. Flores,et al.  Multiple protein structure alignment , 1994, Protein science : a publication of the Protein Society.

[71]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[72]  W. Taylor,et al.  Multiple sequence threading: an analysis of alignment quality and stability. , 1997, Journal of molecular biology.

[73]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[74]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[75]  O. Gotoh Alignment of three biological sequences with an efficient traceback procedure. , 1986, Journal of theoretical biology.

[76]  M. I. Kanehisa,et al.  Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and symmetries , 1982, Nucleic Acids Res..

[77]  D. Higgins,et al.  RAGA: RNA sequence alignment by genetic algorithm. , 1997, Nucleic acids research.

[78]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[79]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[80]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[81]  S Brunak,et al.  Multiple alignment using simulated annealing: branch point definition in human mRNA splicing. , 1992, Nucleic acids research.

[82]  T. L. Blundell,et al.  Knowledge-based prediction of protein structures and the design of novel molecules , 1987, Nature.

[83]  N. Saitou,et al.  Reconstruction of gene trees from sequence data. , 1996, Methods in enzymology.

[84]  J. Gibrat,et al.  Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs. , 1987, Journal of molecular biology.

[85]  P. Argos,et al.  Analysis of insertions/deletions in protein structures. , 1992, Journal of molecular biology.

[86]  Philip Taylor,et al.  A fast homology program for aligning biological sequences , 1984, Nucleic Acids Res..

[87]  P. Argos,et al.  Motif recognition and alignment for many sequences by comparison of dot-matrices. , 1991, Journal of molecular biology.

[88]  A A Salamov,et al.  Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. , 1995, Journal of molecular biology.

[89]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[90]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[91]  R Staden,et al.  An interactive graphics program for comparing and aligning nucleic acid and amino acid sequences. , 1982, Nucleic acids research.

[92]  J. Spouge Speeding up dynamic programming algorithms for finding optimal lattice paths , 1989 .

[93]  T. Smith,et al.  Alignment of protein sequences using secondary structure: a modified dynamic programming method. , 1990, Protein engineering.

[94]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[95]  O. Gotoh,et al.  Divergent structures of Caenorhabditis elegans cytochrome P450 genes suggest the frequent loss and gain of introns during the evolution of nematodes. , 1998, Molecular biology and evolution.

[96]  G. Barton,et al.  Multiple protein sequence alignment from tertiary structure comparison: Assignment of global and residue confidence levels , 1992, Proteins.

[97]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[98]  M. Bishop,et al.  Maximum likelihood alignment of DNA sequences. , 1986, Journal of molecular biology.

[99]  J. Adachi,et al.  MOLPHY version 2.3 : programs for molecular phylogenetics based on maximum likelihood , 1996 .

[100]  D. Higgins,et al.  SAGA: sequence alignment by genetic algorithm. , 1996, Nucleic acids research.

[101]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[102]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[103]  O. Gotoh,et al.  Optimal sequence alignment allowing for long gaps. , 1990, Bulletin of mathematical biology.

[104]  M. Nei,et al.  Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection , 1988, Nature.

[105]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[106]  Hiroshi Imai,et al.  Enhanced A* Algorithms for Multiple Alignments: Optimal Alignments for Several Sequences and k-Opt Approximate Alignments for Large Cases , 1999, Theoretical Computer Science.

[107]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[108]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[109]  J. Thompson,et al.  The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. , 1997, Nucleic acids research.

[110]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[111]  T L Blundell,et al.  A variable gap penalty function and feature weights for protein 3-D structure comparisons. , 1992, Protein engineering.

[112]  Sandeep K. Gupta,et al.  Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment , 1995, J. Comput. Biol..

[113]  M. Nei,et al.  Phylogenetic analysis in molecular evolutionary genetics. , 1996, Annual review of genetics.

[114]  Hans Söderlund,et al.  Algorithms for the search of amino acid patterns in nucleic acid sequences , 1986, Nucleic Acids Res..

[115]  T L Blundell,et al.  CAMPASS: a database of structurally aligned protein superfamilies. , 1998, Structure.

[116]  Liisa Holm,et al.  COFFEE: an objective function for multiple sequence alignments , 1998, Bioinform..

[117]  M. Kanehisa,et al.  Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. , 1996, Protein engineering.

[118]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[119]  M Levitt,et al.  Alignment of the amino acid sequences of distantly related proteins using variable gap penalties. , 1986, Protein engineering.