Significance of Gapped Sequence Alignments

Measurement of the the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 +/- 0.3) x 10(-1314). Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.

[1]  Lee Aaron Newberg Memory-efficient dynamic programming backtrace and pairwise local sequence alignment , 2008, Bioinform..

[2]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[3]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[4]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..

[6]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[7]  Stefan Wolfsheimer,et al.  Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail , 2007, Algorithms for Molecular Biology.

[8]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[9]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[10]  Sean R. Eddy,et al.  A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation , 2008, PLoS Comput. Biol..

[11]  Alexander K Hartmann,et al.  Sampling rare events: statistics of local sequence alignments. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.