Approximate Statistics of Gapped Alignments

A heuristic approximation to the score distribution of gapped alignments in the logarithmic domain is presented. The method applies to comparisons between random, unrelated protein sequences, using standard score matrices and arbitrary gap penalties. It is shown that gapped alignment behavior is essentially governed by a single parameter, alpha, depending on the penalty scheme and sequence composition. This treatment also predicts the position of the transition point between logarithmic and linear behavior. The approximation is tested by simulation and shown to be accurate over a range of commonly used substitution matrices and gap-penalties.

[1]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[2]  G. Gonnet,et al.  Empirical and structural models for insertions and deletions in the divergent evolution of proteins. , 1993, Journal of molecular biology.

[3]  M. Waterman,et al.  Phase transitions in sequence matches and nucleic acid structure. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[4]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[5]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[6]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[7]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[8]  Amir Dembo,et al.  Strong limit theorems of empirical functionals for large exceedances of partial sums of i , 1991 .

[9]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[10]  D. Iglehart Extreme Values in the GI/G/1 Queue , 1972 .

[11]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[12]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.

[13]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[14]  Raffaele Giancarlo,et al.  Sequence alignment in molecular biology , 1998, Mathematical Support for Molecular Biology.

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  Richard Mott Local sequence alignments with monotonic gap penalties , 1999, Bioinform..

[17]  M S Waterman,et al.  Efficient sequence alignment algorithms. , 1984, Journal of theoretical biology.

[18]  M. O. Dayhoff,et al.  Establishing homologies in protein sequences. , 1983, Methods in enzymology.

[19]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[20]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Martin Vingron,et al.  Statistics of large scale sequence searching , 1997, German Conference on Bioinformatics.

[22]  Martin Vingron,et al.  Sequence Comparison Significance and Poisson Approximation , 1994 .

[23]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[24]  M. Waterman,et al.  A Phase Transition for the Score in Matching Random Sequences Allowing Deletions , 1994 .

[25]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[27]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[28]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[29]  Claudia Neuhauser,et al.  A Poisson Approximation for Sequence Comparisons with Insertions and Deletions , 1994 .

[30]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[31]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[32]  A. Dembo,et al.  Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score , 1994 .

[33]  Jun Zhu,et al.  Bayesian adaptive sequence alignment algorithms , 1998, Bioinform..

[34]  Sلأren Asmussen,et al.  Applied Probability and Queues , 1989 .

[35]  Amir Dembo,et al.  LIMIT DISTRIBUTIONS OF MAXIMAL SEGMENTAL SCORE AMONG MARKOV-DEPENDENT PARTIAL SUMS , 1992 .

[36]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[37]  David Siegmund,et al.  Approximate P-Values for Local Sequence Alignments: Numerical Studies , 2001, J. Comput. Biol..

[38]  Aleksandar Milosavljevic,et al.  Sequence Comparisons via Algorithmic Mutual Information , 1994, ISMB.

[39]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[40]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[41]  Terence Hwa,et al.  Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models , 2001, J. Comput. Biol..

[42]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[43]  C. Ponting,et al.  Protein repeats: structures, functions, and evolution. , 2001, Journal of structural biology.

[44]  Susan R. Wilson INTRODUCTION TO COMPUTATIONAL BIOLOGY: MAPS, SEQUENCES AND GENOMES. , 1996 .

[45]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[46]  Terence Hwa,et al.  A Statistical Theory of Sequence Alignment with Gaps , 1998, ISMB.