Improved gapped alignment in BLAST

Homology search is a key tool for understanding the role, structure, and biochemical function of genomic sequences. The most popular technique for rapid homology search is blast, which has been in widespread use within universities, research centers, and commercial enterprises since the early 1990s. In this paper, we propose a new step in the blast algorithm to reduce the computational cost of searching with negligible effect on accuracy. This new step¿semigapped alignment¿compromises between the efficiency of ungapped alignment and the accuracy of gapped alignment, allowing blast to accurately filter sequences with lower computational cost. In addition, we propose a heuristic¿restricted insertion alignment¿that avoids unlikely evolutionary paths with the aim of reducing gapped alignment cost with negligible effect on accuracy. Together, after including an optimization of the local alignment recursion, our two techniques more than double the speed of the gapped alignment stages in blast. We conclude that our techniques are an important improvement to the blast algorithm. Source code for the alignment algorithms is available for download at http://www.bsg.rmit.edu.au/iga/.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[3]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[4]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[5]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[6]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[7]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[10]  W R Pearson,et al.  Dynamic programming algorithms for biological sequence comparison. , 1992, Methods in enzymology.

[11]  Kun-Mao Chao,et al.  Aligning two sequences within a specified diagonal band , 1992, Comput. Appl. Biosci..

[12]  Kun-Mao Chao,et al.  Recent Developments in Linear-Space Alignment Methods: A Survey , 1994, J. Comput. Biol..

[13]  S. Altschul,et al.  Issues in searching molecular sequence databases , 1994, Nature Genetics.

[14]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[15]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[16]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[17]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[18]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[19]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[20]  William R. Pearson,et al.  Aligning a DNA sequence with a protein sequence , 1997, RECOMB '97.

[21]  S F Altschul,et al.  Generalized affine gap costs for protein sequence alignment , 1998, Proteins.

[22]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[23]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[25]  Piotr Berman,et al.  Alignments without Low-Scoring Regions , 1998, J. Comput. Biol..

[26]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[27]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[28]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[29]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[30]  Zhuoran Chen Assessing sequence comparison methods with the average precision criterion , 2003, Bioinform..

[31]  Eugene W. Myers,et al.  A Table-Driven, Full-Sensitivity Similarity Search Algorithm , 2003, J. Comput. Biol..

[32]  Bin Ma,et al.  Patternhunter Ii: Highly Sensitive and Fast Homology Search , 2004, J. Bioinform. Comput. Biol..

[33]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[34]  Thomas L. Madden,et al.  BLAST: at the core of a powerful and diverse set of sequence analysis tools , 2004, Nucleic Acids Res..

[35]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..