A Deterministic Finite Automaton for Faster Protein Hit Detection in BLAST

BLAST is the most popular bioinformatics tool and is used to run millions of queries each day. However, evaluating such queries is slow, taking typically minutes on modern workstations. Therefore, continuing evolution of BLAST--by improving its algorithms and optimizations--is essential to improve search times in the face of exponentially increasing collection sizes. We present an optimization to the first stage of the BLAST algorithm specifically designed for protein search. It produces the same results as NCBI-BLAST but in around 59% of the time on Intel-based platforms; we also present results for other popular architectures. Overall, this is a saving of around 15% of the total typical BLAST search time. Our approach uses a deterministic finite automaton (DFA), inspired by the original scheme used in the 1990 BLAST algorithm. The techniques are optimized for modern hardware, making careful use of cache-conscious approaches to improve speed. Our optimized DFA approach has been integrated into a new version of BLAST that is freely available for download at http://www.fsa-blast.org/.

[1]  Hugh E. Williams,et al.  Compression of nucleotide databases for fast searching , 1997, Comput. Appl. Biosci..

[2]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[3]  Zhuoran Chen Assessing sequence comparison methods with the average precision criterion , 2003, Bioinform..

[4]  Thomas L. Madden,et al.  Protein sequence similarity searches using patterns as seeds. , 1998, Nucleic acids research.

[5]  M F Lawrence,et al.  Impedance-based detection of DNA sequences using a silicon transducer with PNA as the probe layer. , 2004, Nucleic acids research.

[6]  Daniel G. Brown Optimizing Multiple Seeds for Protein Homology Search , 2005, TCBB.

[7]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Valer Gotea,et al.  Mastering seeds for genomic size nucleotide BLAST searches. , 2003, Nucleic acids research.

[10]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[11]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[13]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Bin Ma,et al.  PatternHunter II: highly sensitive and fast homology search. , 2003, Genome informatics. International Conference on Genome Informatics.

[15]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[16]  M. Cameron,et al.  Improved gapped alignment in BLAST , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[18]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..

[19]  Thomas L. Madden,et al.  BLAST: at the core of a powerful and diverse set of sequence analysis tools , 2004, Nucleic Acids Res..

[20]  Chet Langin,et al.  Languages and Machines: An Introduction to the Theory of Computer Science , 2007 .

[21]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..