Sequence alignment and penalty choice. Review of concepts, case studies and implications.

Alignment algorithms to compare DNA or amino acid sequences are widely used tools in molecular biology. The algorithms depend on the setting of various parameters, most notably gap penalties. The effect that such parameters have on the resulting alignments is still poorly understood. This paper begins by reviewing two recent advances in algorithms and probability that enable us to take a new approach to this question. The first tool we introduce is a newly developed method to delineate efficiently all optimal alignments arising under all choices of parameters. The second tool comprises insights into the statistical behavior of optimal alignment scores. From this we gain a better understanding of the dependence of alignments on parameters in general. We propose novel criteria to detect biologically good alignments and highlight some specific features about the interaction between similarity matrices and gap penalties. To illustrate our analysis we present a detailed study of the comparison of two immunoglobulin sequences.

[1]  David Sankoff,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[2]  M. O. Dayhoff,et al.  Establishing homologies in protein sequences. , 1983, Methods in enzymology.

[3]  E. Lander,et al.  Parametric sequence comparisons. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[4]  David Fernández-Baca,et al.  Constructing the minimization diagram of a two-parameter problem , 1991, Oper. Res. Lett..

[5]  G J Barton,et al.  Evaluation and improvements in the automatic alignment of protein sequences. , 1987, Protein engineering.

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[8]  Dan Gusfield,et al.  Parametric optimization of sequence alignment , 1992, SODA '92.

[9]  Jost Ludwig,et al.  Cyclic Nucleotide-gated Channels — A Family of Proteins Involved in Vertebrate Photoreception and Olfaction , 1992 .

[10]  Martin Vingron,et al.  A new interactive protein sequence alignment program and comparison of its results with widely used algorithms , 1989, Comput. Appl. Biosci..

[11]  R. Doolittle,et al.  Homology among DNA-binding proteins suggests use of a conserved super-secondary structure , 1982, Nature.

[12]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[13]  R. Poljak,et al.  Three-dimensional structure of immunoglobulins. , 1979, Annual review of biochemistry.

[14]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[15]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[16]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[17]  O. Gotoh,et al.  Optimal sequence alignment allowing for long gaps. , 1990, Bulletin of mathematical biology.

[18]  T. Smith,et al.  Optimal sequence alignments. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[19]  M. Waterman,et al.  The Erdos-Renyi Law in Distribution, for Coin Tossing and Sequence Matching , 1990 .

[20]  P Argos,et al.  Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences , 1988, Proteins.

[21]  M. Waterman,et al.  Phase transitions in sequence matches and nucleic acid structure. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[22]  M Gribskov,et al.  Sigma factors from E. coli, B. subtilis, phage SP01, and phage T4 are homologous proteins. , 1986, Nucleic acids research.