Mercury BLASTN: Faster DNA Sequence Comparison using a Streaming Hardware Architecture

Motivation: Large-scale DNA sequence comparison, as implemented by BLAST and related algorithms, is one of the pillars of modern genomic analysis. One way to accelerate these computations is with a streaming architecture, in which processors are arranged in a pipeline that replicates the multistage structure of the algorithm. To achieve high performance, the processor hardware implementing the critical seed matching and ungapped extension stages of BLAST should be specialized to execute these stages as quickly as possible. However, accelerating these stages requires solving two key problems: rst, the seed matching stage is not of a form which has traditionally been amenable to hardware acceleration; and second, the accelerated implementation of BLAST should retain sensitivity at least comparable to that of the original software. Results: We describe Mercury BLASTN, an FPGA-based implementation of BLAST for DNA. Mercury BLASTN combines a Bloom lter ing approach to seed matching with a modied ungapped extension algorithm to overcome barriers to placing the early stages of BLAST onto hardware. On a previous-generation FPGA hardware platform, Mercury BLASTN runs 5 to 11 times faster than NCBI BLASTN current-generation general-purpose CPUs, with the prospect of a further eight-fold speedup on current-generation FPGAs. Moreover, its sensitivity to signicant DNA sequence alignments is 99% of that observed with software NCBI BLASTN. Availability: Academic users should contact the authors for information on acquiring a prototype of the Mercury BLASTN system. Contact: jbuhler@cse.wustl.edu

[1]  John W. Lockwood,et al.  Fast and Scalable Pattern Matching for Network Intrusion Detection Systems , 2006, IEEE Journal on Selected Areas in Communications.

[2]  Lukas Wagner,et al.  A Greedy Algorithm for Aligning DNA Sequences , 2000, J. Comput. Biol..

[3]  Martin C. Herbordt,et al.  Single Pass, BLAST-Like, Approximate String Matching on FPGAs , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[4]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[5]  Nagiza F. Samatova,et al.  Efficient data access for parallel BLAST , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[6]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[7]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[8]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[9]  David E. Taylor,et al.  Longest prefix matching using bloom filters , 2006, TNET.

[10]  Dzung T. Hoang,et al.  Searching genetic databases on Splash 2 , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[11]  M. V. Ramakrishna,et al.  Efficient Hardware Hashing Functions for High Performance Computers , 1997, IEEE Trans. Computers.

[12]  Joseph M. Lancaster,et al.  Acceleration of ungapped extension in Mercury BLAST , 2009, Microprocess. Microsystems.

[13]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[14]  Brian E. Smith,et al.  Massively Parallel BLAST for the Blue Gene / L , 2005 .

[15]  Paul Gardner-Stephen,et al.  DASH: localising dynamic programming for order of magnitude faster, accurate sequence alignment , 2004 .

[16]  Apostolos Dollas,et al.  Some initial results on hardware BLAST acceleration with a reconfigurable architecture , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[17]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[18]  Roger D. Chamberlain,et al.  Performance evaluation for hybrid architectures , 2006 .

[19]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2003, RECOMB '03.

[20]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[21]  Dominique Lavenier,et al.  A Reconfigurable Parallel Disk System for Filtering Genomic Banks , 2003, Engineering of Reconfigurable Systems and Algorithms.

[22]  Design and Evaluation of a BLAST Ungapped Extension Accelerator, Master's Thesis , 2006 .

[23]  Joseph M. Lancaster,et al.  FPGA-accelerated seed generation in Mercury BLASTP , 2007, 15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2007).

[24]  Jeremy Buhler,et al.  Designing seeds for similarity search in genomic DNA , 2005, J. Comput. Syst. Sci..

[25]  Joseph M. Lancaster,et al.  Biosequence similarity search on the Mercury system , 2004, Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004..

[26]  Akihiko Konagaya,et al.  High Speed Homology Search with FPGAs , 2001, Pacific Symposium on Biocomputing.

[27]  Ting Wang,et al.  Identifying the conserved network of cis-regulatory sites of a eukaryotic genome. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Elon Portugaly,et al.  HMMERHEAD-Accelerating HMM Searches On Large Databases , 2004 .

[29]  Richard Hughey,et al.  Kestrel: A Programmable Array for Sequence Analysis , 1996, Proceedings of International Conference on Application Specific Systems, Architectures and Processors: ASAP '96.

[30]  Bin Ma,et al.  PatternHunter II: highly sensitive and fast homology search. , 2003, Genome informatics. International Conference on Genome Informatics.

[31]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[32]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[33]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[34]  Mark A. Franklin,et al.  The Mercury system: exploiting truly fast hardware for data search , 2003, SNAPI '03.

[35]  Robert E. Tarjan,et al.  Storing a sparse table , 1979, CACM.

[36]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.