Accelerated convergence and robust asymptotic regression of the Gumbel scale parameter for gapped sequence alignment

Searches through biological databases provide the primary motivation for studying sequence alignment statistics. Other motivations include physical models of annealing processes or mathematical similarities to, e.g., first-passage percolation and interacting particle systems. Here, we investigate sequence alignment statistics, partly to explore two general mathematical methods. First, we model the global alignment of random sequences heuristically with Markov additive processes. In sequence alignment, the heuristic suggests a numerical acceleration scheme for simulating an important asymptotic parameter (the Gumbel scale parameter λ). The heuristic might apply to similar mathematical theories. Second, we extract the asymptotic parameter λ from simulation data with the statistical technique of robust regression. Robust regression is admirably suited to 'asymptotic regression' and deserves to be better known for it.

[1]  Terence Hwa,et al.  Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models , 2001, J. Comput. Biol..

[2]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[3]  ec 2 00 3 Asymmetric Simple Exclusion Process with Open Boundaries and , 2008 .

[4]  J. D. T. Oliveira,et al.  The Asymptotic Theory of Extreme Order Statistics , 1979 .

[5]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[6]  Richard Mott,et al.  Approximate Statistics of Gapped Alignments , 1999, J. Comput. Biol..

[7]  Ralf Bundschuh Rapid Significance Estimation in Local Sequence Alignment with Gaps , 2002, J. Comput. Biol..

[8]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[9]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[10]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[11]  Erhan Çinlar,et al.  Introduction to stochastic processes , 1974 .

[12]  T. T. Soong,et al.  Book Reviews : INTRODUCTION TO STOCHASTIC PROCESSES E. Cinlar Prentice-Hall, 1975 , 1979 .

[13]  Sلأren Asmussen,et al.  Applied Probability and Queues , 1989 .

[14]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[15]  J. Spouge Path reversal, islands, and the gapped alignment of random sequences , 2004 .

[16]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[17]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[18]  D. T. Elmore,et al.  Peptides and Proteins , 1968 .

[19]  A. Dembo,et al.  Limit Distribution of Maximal Non-Aligned Two-Sequence Segmental Score , 1994 .

[20]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[21]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[22]  Richard Mott Local sequence alignments with monotonic gap penalties , 1999, Bioinform..

[23]  M. Waterman,et al.  A Phase Transition for the Score in Matching Random Sequences Allowing Deletions , 1994 .

[24]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[25]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[27]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[28]  R. Bundschuh,et al.  Asymmetric exclusion process and extremal statistics of random sequences. , 1999, Physical review. E, Statistical, nonlinear, and soft matter physics.

[29]  Thomas P. Ryan,et al.  Modern Regression Methods , 1996 .

[30]  R. Caprioli,et al.  Peptides and proteins , 2001, Nature.

[31]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[32]  D. F. Andrews,et al.  A Robust Method for Multiple Linear Regression , 1974 .

[33]  Benjamin Yakir,et al.  Approximate p-values for local sequence alignments , 2000 .

[34]  David Siegmund,et al.  Approximate P-Values for Local Sequence Alignments: Numerical Studies , 2001, J. Comput. Biol..

[35]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[36]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[37]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[38]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[39]  M. Waterman,et al.  Phase transitions in sequence matches and nucleic acid structure. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .