The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment

The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor.

[1]  J. Brian Gray,et al.  Introduction to Linear Regression Analysis , 2002, Technometrics.

[2]  Benjamin Yakir,et al.  Large deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments , 2004 .

[3]  Walter L. Smith Probability and Statistics , 1959, Nature.

[4]  Terence Hwa,et al.  Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models , 2001, J. Comput. Biol..

[5]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[6]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[7]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[8]  D. F. Andrews,et al.  A Robust Method for Multiple Linear Regression , 1974 .

[9]  J. Spouge Finite-size corrections to Poisson approximations in general renewal-success processes , 2005 .

[10]  M. Waterman,et al.  A Phase Transition for the Score in Matching Random Sequences Allowing Deletions , 1994 .

[11]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[12]  Richard Mott,et al.  Approximate Statistics of Gapped Alignments , 1999, J. Comput. Biol..

[13]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Benjamin Yakir,et al.  Approximate p-values for local sequence alignments , 2000 .

[15]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[16]  Richard Mott Local sequence alignments with monotonic gap penalties , 1999, Bioinform..

[17]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[18]  C. J. Lawrence Robust estimates of location : survey and advances , 1975 .

[19]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.

[20]  A. Dembo,et al.  Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score , 1994 .

[21]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[22]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[23]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[25]  John L. Spouge,et al.  Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment , 2005 .

[26]  David Siegmund,et al.  Approximate P-Values for Local Sequence Alignments: Numerical Studies , 2001, J. Comput. Biol..

[27]  John L. Spouge,et al.  Finite-size corrections to Poisson approximations of rare events in renewal processes , 2001, Journal of Applied Probability.

[28]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[29]  J. Spouge Path reversal, islands, and the gapped alignment of random sequences , 2004 .

[30]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[31]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[32]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[33]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[34]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[35]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[36]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.