Accelerator design for protein sequence HMM search

Profile Hidden Markov models (HMMs) are a powerful approach to describing biologically significant functional units, or motifs, in protein sequences. Entire databases of such models are regularly compared to large collections of proteins to recognize motifs in them. Exponentially increasing rates of genome sequencing have caused both protein and model databases to explode in size, placing an ever-increasing computational burden on users of these systems.Here, we describe an accelerated search system that exploits parallelism in a number of ways. First, the application is functionally decomposed into a pipeline, with distinct compute resources executing each pipeline stage. Second, the first pipeline stage is deployed on a systolic array, which yields significant fine-grained parallelism. Third, for some instantiations of the design, parallel copies of the first pipeline stage are used, further increasing the level of coarse-grained parallelism.A naïve parallelization of the first stage computation has serious repercussions for the sensitivity of the search. We present a pair of remedies to this dilemma and quantify the regions of interest within which each approach is most effective. Analytic performance models are used to assess the overall speedup that can be attained relative to a single-processor software solution. Performance improvements of 1 to 2 orders of magnitude are predicted.

[1]  Bertil Schmidt,et al.  Hyper customized processors for bio-sequence database scanning on FPGAs , 2005, FPGA '05.

[2]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[3]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[4]  Bertil Schmidt,et al.  High performance biosequence database scanning on reconfigurable platforms , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Roger D. Chamberlain,et al.  Streaming Data from Disk Store to Application , 2005 .

[6]  Martin C. Herbordt,et al.  Families of FPGA-based algorithms for approximate string matching , 2004 .

[7]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Dzung T. Hoang,et al.  Searching genetic databases on Splash 2 , 1993, [1993] Proceedings IEEE Workshop on FPGAs for Custom Computing Machines.

[9]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[10]  Patrick Crowley,et al.  Exploiting coarse-grained parallelism to accelerate protein motif finding with a network processor , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[11]  Sun-Yuan Kung,et al.  Hidden Markov models for character recognition , 1992, IEEE Trans. Image Process..

[12]  Qiong Zhang,et al.  An FPGA-based Search Engine for Unstructured Database , 2003 .

[13]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[14]  Akihiko Konagaya,et al.  High Speed Homology Search with FPGAs , 2001, Pacific Symposium on Biocomputing.

[15]  Pat Hanrahan,et al.  ClawHMMER: A Streaming HMMer-Search Implementatio , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[16]  Pat Hanrahan,et al.  ClawHMMER: A Streaming HMMer-Search Implementation , 2005, SC.

[17]  Martin C. Herbordt,et al.  Families of FPGA-based algorithms for approximate string matching , 2004, Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004..

[18]  Dirk Ourston,et al.  Applications of hidden Markov models to detecting multi-stage network attacks , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[19]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.