Efficient data access for parallel BLAST

Searching biological sequence databases is one of the most routine tasks in computational biology. This task is significantly hampered by the exponential growth in sequence database sizes. Recent advances in parallelization of biological sequence search applications have enabled bioinformatics researchers to utilize high-performance computing platforms and, as a result, greatly reduce the execution time of their sequence database searches. However, existing parallel sequence search tools have been focusing mostly on parallelizing the sequence alignment engine. While the computation-intensive alignment tasks become cheaper with larger machines, data-intensive initial preparation and result merging tasks become more expensive. Inefficient handling of input and output data can easily create performance bottlenecks even on supercomputers. It also causes a considerable data management overhead. In this paper, we present a set of techniques for efficient and flexible data handling in parallel sequence search applications. We demonstrate our optimizations through improving mpiBLAST, an open-source parallel BLAST tool rapidly gaining popularity. These optimization techniques aim at enabling flexible database partitioning, reducing I/O by caching small auxiliary files and results, enabling parallel I/O on shared files, and performing scalable result processing protocols. As a result, we reduce mpiBLAST users' operational overhead by removing the requirement of pre-partitioning databases. Meanwhile, our experiments show that these techniques can bring by an order of magnitude improvement to both the overall performance and scalability of mpiBLAST.

[1]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[2]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[3]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[4]  David R. Mathog,et al.  Parallel BLAST on split databases , 2003, Bioinform..

[5]  John V. Carlis,et al.  Efficiency of shared-memory multiprocessors for a genetic sequence similarity search algorithm , 1996 .

[6]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[7]  Melissa Lemos,et al.  A study of a multi-ring buffer management for BLAST , 2003, 14th International Workshop on Database and Expert Systems Applications, 2003. Proceedings..

[8]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[9]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[10]  Thomas L. Casavant,et al.  Parallelization of local BLAST service on workstation clusters , 2001, Future Gener. Comput. Syst..

[11]  Rogério Luís de Carvalho Costa,et al.  Database Allocation Strategies for Parallel BLAST Evaluation on Clusters , 2004, Distributed and Parallel Databases.

[12]  Robert D. Bjornson,et al.  TurboBLAST : a parallel implementation of blast built on the turbohub , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[13]  Denis C. Shields,et al.  Wrapping up BLAST and other applications for use on Unix clusters , 2003, Bioinform..

[14]  Jiren Wang,et al.  Soap-HT-BLAST: high throughput BLAST based on Web services , 2003, Bioinform..

[15]  Hae-Jin Kim,et al.  Hyper-BLAST: A Parallelized BLAST on Cluster System , 2003, International Conference on Computational Science.

[16]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[17]  Bu-Sung Lee,et al.  Key Message Approach to Optimize Communication of Parallel Applications on Clusters , 2004, Cluster Computing.

[18]  John May,et al.  Parallel I/O for High Performance Computing , 2000 .

[19]  Roland L. Dunbrack,et al.  BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf cluster , 2002, Bioinform..

[20]  Christopher Hoover,et al.  Hardware and software systems for accelerating common bioinformatics sequence analysis algorithms , 2004 .

[21]  Gregory Butler,et al.  Three improvements to the BLASTP search of genome databases , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.