New finite-size correction for local alignment score distributions

BackgroundLocal alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a “finite-size” correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score.FindingsWe present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match. We use a test set derived from ASTRAL to show improved ROC scores, especially for shorter sequences.ConclusionsThe new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST+ package and at the NCBI BLAST web site (http://blast.ncbi.nlm.nih.gov).

[1]  Yonil Park,et al.  ESTIMATING THE GUMBEL SCALE PARAMETER FOR LOCAL ALIGNMENT OF RANDOM SEQUENCES BY IMPORTANCE SAMPLING WITH STOPPING TIMES. , 2009, Annals of statistics.

[2]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.

[3]  Ralf Bundschuh,et al.  A Practical Approach to Significance Assessment in Alignment with Gaps , 2005, RECOMB.

[4]  Alexander K Hartmann,et al.  Sampling rare events: statistics of local sequence alignments. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Philippe Ortet,et al.  Where Does the Alignment Score Distribution Shape Come from? , 2010, Evolutionary bioinformatics online.

[6]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[7]  Lee Aaron Newberg Significance of Gapped Sequence Alignments , 2008, J. Comput. Biol..

[8]  J. D. T. Oliveira,et al.  The Asymptotic Theory of Extreme Order Statistics , 1979 .

[9]  Aleksandar Poleksic Island method for estimating the statistical significance of profile-profile alignment scores , 2008, BMC Bioinformatics.

[10]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[11]  Ankit Agrawal,et al.  Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment , 2008, Int. J. Comput. Biol. Drug Des..

[12]  Ankit Agrawal,et al.  Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  S. Suhai Theoretical and Computational Methods in Genome Research , 2012, Springer US.

[14]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[15]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[16]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[18]  Upendra Dave,et al.  Applied Probability and Queues , 1987 .

[19]  Stephen F. Altschul,et al.  Evaluating the Statistical Significance of Multiple Distinct Local Alignments , 1997 .

[20]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[21]  J. Spouge,et al.  Objective method for estimating asymptotic parameters, with an application to sequence alignment. , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.