A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

[1]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[2]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[3]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[4]  Ralf Bundschuh,et al.  A Practical Approach to Significance Assessment in Alignment with Gaps , 2005, RECOMB.

[5]  S. Karlin,et al.  Applications and statistics for multiple high-scoring segments in molecular sequences. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[6]  S. Bryant,et al.  The identification of complete domains within protein sequences using accurate E-values for semi-global alignment , 2007, Nucleic acids research.

[7]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[8]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[9]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[10]  T G Marr,et al.  Alignment of molecular sequences seen as random path analysis. , 1995, Journal of theoretical biology.

[11]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[12]  Terence Hwa,et al.  A Statistical Theory of Sequence Alignment with Gaps , 1998, ISMB.

[13]  Philipp Bucher,et al.  A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System , 1996, ISMB.

[14]  Richard Mott,et al.  Approximate Statistics of Gapped Alignments , 1999, J. Comput. Biol..

[15]  M. Waterman,et al.  Phase transitions in sequence matches and nucleic acid structure. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Terence Hwa,et al.  Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models , 2001, J. Comput. Biol..

[17]  Kimmen Sjölander,et al.  A comparison of scoring functions for protein sequence profile alignment , 2004, Bioinform..

[18]  Gordon Johnston,et al.  Statistical Models and Methods for Lifetime Data , 2003, Technometrics.

[19]  M. Tribus,et al.  Probability theory: the logic of science , 2003 .

[20]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[21]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[22]  Caleb Webber,et al.  Estimation of P-values for global alignments of protein sequences , 2001, Bioinform..

[23]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[24]  Jun S. Liu,et al.  BALSA: Bayesian algorithm for local sequence alignment. , 2002, Nucleic acids research.

[25]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.

[26]  Steven Johnson Rob Mitra Tim Schedl Jim Skeath Gar Stormo,et al.  REMOTE PROTEIN HOMOLOGY DETECTION USING HIDDEN MARKOV MODELS , 2006 .

[27]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[28]  S. Altschul,et al.  The compositional adjustment of amino acid substitution matrices , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Alexander Pertsemlidis,et al.  Having a BLAST with bioinformatics (and avoiding BLASTphemy) , 2001, Genome Biology.

[30]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[31]  Benjamin Yakir,et al.  Approximate p-values for local sequence alignments , 2000 .

[32]  Richard Hughey,et al.  Calibrating E-values for hidden Markov models using reverse-sequence null models , 2005, Bioinform..

[33]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[34]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[35]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[36]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[37]  David Siegmund,et al.  Approximate P-Values for Local Sequence Alignments: Numerical Studies , 2001, J. Comput. Biol..

[38]  Richard Hughey,et al.  Scoring hidden Markov models , 1997, Comput. Appl. Biosci..

[39]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[40]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[41]  Terence Hwa,et al.  Hybrid alignment: high-performance with universal statistics , 2002, Bioinform..

[42]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..

[43]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[44]  Jun S. Liu,et al.  Bayesian inference on biopolymer models , 1999, Bioinform..

[45]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[46]  S. Eddy Computational Genomics of Noncoding RNA Genes , 2002, Cell.

[47]  Dirk Metzler,et al.  Robust E-Values for Gapped Local Alignments , 2006, J. Comput. Biol..

[48]  Terence Hwa,et al.  Statistical significance and extremal ensemble of gapped local hybrid alignment , 2002 .

[49]  Jerald F. Lawless,et al.  Statistical Models and Methods for Lifetime Data. , 1983 .

[50]  Aleksandar Milosavljevic,et al.  Discovering simple DNA sequences by the algorithmic significance method , 1993, Comput. Appl. Biosci..

[51]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[52]  W. Pearson,et al.  The limits of protein sequence comparison? , 2005, Current opinion in structural biology.

[53]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[54]  Sean R. Eddy,et al.  Query-Dependent Banding (QDB) for Faster RNA Similarity Searches , 2007, PLoS Comput. Biol..