Parallel pairwise statistical significance estimation of local sequence alignment using Message Passing Interface library

Homology detection is a fundamental step in sequence analysis. In the recent years, pairwise statistical significance has emerged as a promising alternative to database statistical significance for homology detection. Although more accurate, currently it is much time consuming because it involves generating tens of hundreds of alignment scores to construct the empirical score distribution. This paper presents a parallel algorithm for pairwise statistical significance estimation, called MPIPairwiseStatSig, implemented in C using MPI library. We further apply the parallelization technique to estimate non‐conservative pairwise statistical significance using standard, sequence‐specific, and position‐specific substitution matrices, which has earlier demonstrated superior sequence comparison accuracy than original pairwise statistical significance. Distributing the most compute‐intensive portions of the pairwise statistical significance estimation procedure across multiple processors has been shown to result in near‐linear speed‐ups for the application. The MPIPairwiseStatSig program for pairwise statistical significance estimation is available for free academic use at www.cs.iastate.edu~ankitag/MPIPairwiseStatSig.html. Copyright © 2011 John Wiley & Sons, Ltd.

[1]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[2]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[3]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[4]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[5]  Alok N. Choudhary,et al.  MPIPairwiseStatSig: parallel pairwise statistical significance estimation of local sequence alignment , 2010, HPDC '10.

[6]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[7]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[8]  Ankit Agrawal,et al.  Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment , 2008, Int. J. Comput. Biol. Drug Des..

[9]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[10]  Alok N. Choudhary,et al.  Efficient Pairwise Statistical Significance Estimation using FPGAs , 2010, BIOCOMP.

[11]  Kun-Mao Chao,et al.  A generalized global alignment algorithm , 2003, Bioinform..

[12]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[13]  Ankit Agrawal,et al.  PSIBLAST_PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance , 2009, Bioinform..

[14]  C. V. Jongeneel,et al.  Making Sense of Score Statistics for Sequence Alignments , 2001, Briefings Bioinform..

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  R. Agarwala,et al.  Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches , 2006, Nucleic acids research.

[17]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[18]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[19]  Ankit Agrawal,et al.  Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty , 2009, BMC Bioinformatics.

[20]  Ankit Agrawal,et al.  Pairwise Statistical Significance Versus Database Statistical Significance for Local Alignment of Protein Sequences , 2008, ISBRA.

[21]  Alok N. Choudhary,et al.  Derived distribution points heuristic for fast pairwise statistical significance estimation , 2010, BCB '10.

[22]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[23]  S. Bandyopadhyay,et al.  A Parallel Pairwise Local Sequence Alignment Algorithm , 2009, IEEE Transactions on NanoBioscience.

[24]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[25]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[26]  Richard Mott Alignment: Statistical Significance , 2005 .

[27]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[28]  Ankit Agrawal,et al.  Estimating Pairwise Statistical Significance of Protein Local Alignments Using a Clustering-Classification Approach Based on Amino Acid Composition , 2008, ISBRA.

[29]  Ankit Agrawal,et al.  Pairwise Statistical Significance of Local Sequence Alignment Using Substitution Matrices with Sequence-Pair-Specific Distance , 2008, 2008 International Conference on Information Technology.

[30]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[31]  Ankit Agrawal,et al.  Pairwise Statistical Significance of Local Sequence Alignment Using Sequence-Specific and Position-Specific Substitution Matrices , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[33]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..

[34]  Alok N. Choudhary,et al.  Non-Conservative Pairwise Statistical Significance of Local Sequence Alignment Using Position-Specific Substitution Matrices , 2010, BIOCOMP.

[35]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[36]  Stephen F. Altschul,et al.  The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions , 2005, Bioinform..

[37]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[38]  Ankit Agrawal,et al.  Sequence-specific sequence comparison using pairwise statistical significance. , 2011, Advances in experimental medicine and biology.

[39]  Ankit Agrawal,et al.  DNAlignTT: Pairwise DNA alignment with sequence specific transition-transversion ratio , 2008, 2008 IEEE International Conference on Electro/Information Technology.

[40]  William R. Pearson,et al.  Statistical Significance in Biological Sequence Comparison , 2004 .

[41]  D. Brutlag,et al.  Dynamic use of multiple parameter sets in sequence alignment , 2006, Nucleic acids research.

[42]  Ankit Agrawal,et al.  Pairwise DNA Alignment with Sequence Specific Transition-Transversion Ratio Using Multiple Parameter Sets , 2008, 2008 International Conference on Information Technology.

[43]  Wei-keng Liao,et al.  Efficient pairwise statistical significance estimation for local sequence alignment using GPU , 2011, 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[44]  Ankit Agrawal,et al.  Conservative, Non-conservative and Average Pairwise Statistical Significance of Local Sequence Alignment , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[45]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.