MPIPairwiseStatSig: parallel pairwise statistical significance estimation of local sequence alignment

Sequence comparison is considered as a cornerstone application in bioinformatics, which forms the basis of many other applications. In particular, pairwise sequence alignment is a fundamental step in numerous sequence comparison based applications, where the typical purpose of pairwise sequence alignment step is homology detection, i.e., identifying related sequences. Estimation of statistical significance of a pairwise sequence alignment is crucial in homology detection. A recent development in the field is the use of pairwise statistical significance as an alternative to database statistical significance. Although pairwise statistical significance has been shown to be potentially superior than database statistical significance for homology detection (evaluated in terms of retrieval accuracy), currently it is much time consuming since it involves generating an empirical score distribution by aligning one sequence of the sequence-pair with N random shuffles of the other sequence. In this paper, we present a parallel algorithm for pairwise statistical significance estimation, called MPIPairwiseStatSig, implemented in C using MPI. Distributing the most compute-intensive portions of the pairwise statistical significance estimation procedure across multiple processors has been shown to result in near-linear speed-ups for the application.

[1]  William R. Pearson,et al.  Statistical Significance in Biological Sequence Comparison , 2004 .

[2]  Ankit Agrawal,et al.  Pairwise Statistical Significance of Local Sequence Alignment Using Substitution Matrices with Sequence-Pair-Specific Distance , 2008, 2008 International Conference on Information Technology.

[3]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[4]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[5]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[6]  Ankit Agrawal,et al.  PSIBLAST_PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance , 2009, Bioinform..

[7]  Richard Mott Alignment: Statistical Significance , 2005 .

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[10]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[11]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..

[12]  R. Agarwala,et al.  Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches , 2006, Nucleic acids research.

[13]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[14]  Stephen F. Altschul,et al.  The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions , 2005, Bioinform..

[15]  Bertil Schmidt,et al.  Reconfigurable architectures for bio-sequence database scanning on FPGAs , 2005, IEEE Transactions on Circuits and Systems II: Express Briefs.

[16]  Ankit Agrawal,et al.  Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  D. Brutlag,et al.  Dynamic use of multiple parameter sets in sequence alignment , 2006, Nucleic acids research.

[18]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[20]  Ankit Agrawal,et al.  Pairwise Statistical Significance Versus Database Statistical Significance for Local Alignment of Protein Sequences , 2008, ISBRA.

[21]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[22]  C. V. Jongeneel,et al.  Making Sense of Score Statistics for Sequence Alignments , 2001, Briefings Bioinform..

[23]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[24]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[25]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[26]  Ankit Agrawal,et al.  Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment , 2008, Int. J. Comput. Biol. Drug Des..

[27]  Kun-Mao Chao,et al.  A generalized global alignment algorithm , 2003, Bioinform..

[28]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[29]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[31]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[32]  Ankit Agrawal,et al.  Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty , 2009, BMC Bioinformatics.