Derived distribution points heuristic for fast pairwise statistical significance estimation

Estimation of statistical significance of a pairwise sequence alignment is crucial in homology detection. A recent development in the field is the use of pairwise statistical significance as an alternative to database statistical significance. Although pairwise statistical significance has been shown to be potentially better than database statistical significance in terms of homology detection retrieval accuracy, currently it is much time consuming since it involves generating an empirical score distribution by aligning one sequence of the sequence-pair with N random shuffles of the other sequence. A high value of N produces (statistically and potentially biologically) accurate estimates, but also consumes more time. A low value of N leads to inaccurate fitting of the score distribution, and hence poor estimates of statistical significance. In this paper, we propose a simple heuristic, called the Derived Distribution Points (DDP) heuristic, which is designed taking into account the features of the pairwise statistical significance estimation procedure, and has shown to significantly improve the quality of pairwise statistical significance estimates (evaluated in terms of retrieval accuracy) even when using low values of N. Alternatively, it can be thought of as speeding-up pairwise statistical significance estimation using high values of N, where comparable performance is achieved by actually using a much lower number of random shuffles. Experiments indicate that a speed-up of up to 40 as compared to current implementations can be achieved without loss in retrieval accuracy.

[1]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[2]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[3]  Francesc Rosselló,et al.  Compression ratios based on the Universal Similarity Metric still yield protein distances far from CATH distances , 2006, ArXiv.

[4]  Ankit Agrawal,et al.  Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Stephen F. Altschul,et al.  The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions , 2005, Bioinform..

[6]  W. Pearson,et al.  Sensitivity and selectivity in protein structure comparison , 2004, Protein science : a publication of the Protein Society.

[7]  Ankit Agrawal,et al.  Pairwise Statistical Significance Versus Database Statistical Significance for Local Alignment of Protein Sequences , 2008, ISBRA.

[8]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[9]  Ankit Agrawal,et al.  Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment , 2008, Int. J. Comput. Biol. Drug Des..

[10]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Alok N. Choudhary,et al.  MPIPairwiseStatSig: parallel pairwise statistical significance estimation of local sequence alignment , 2010, HPDC '10.

[12]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[13]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[14]  Kun-Mao Chao,et al.  A generalized global alignment algorithm , 2003, Bioinform..

[15]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[16]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[17]  C. V. Jongeneel,et al.  Making Sense of Score Statistics for Sequence Alignments , 2001, Briefings Bioinform..

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  William R. Pearson,et al.  Statistical Significance in Biological Sequence Comparison , 2004 .

[20]  Ankit Agrawal,et al.  Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty , 2009, BMC Bioinformatics.

[21]  Ankit Agrawal,et al.  Pairwise Statistical Significance of Local Sequence Alignment Using Substitution Matrices with Sequence-Pair-Specific Distance , 2008, 2008 International Conference on Information Technology.

[22]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..

[23]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[24]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[25]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[26]  R. Agarwala,et al.  Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches , 2006, Nucleic acids research.

[27]  Ankit Agrawal,et al.  PSIBLAST_PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance , 2009, Bioinform..

[28]  Richard Mott Alignment: Statistical Significance , 2005 .

[29]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[30]  D. Brutlag,et al.  Dynamic use of multiple parameter sets in sequence alignment , 2006, Nucleic acids research.

[31]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[32]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[33]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..