ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

Genes in an organism's DNA (genome) have embedded in them information about proteins, which are the molecules that do most of a cell's work. A typical bacterial genome contains on the order of 5,000 genes. Mammalian genomes can contain tens of thousands of genes. For each genome sequenced, the challenge is to identify protein components (proteome) being actively used for a given set of conditions. Fundamentally, sequence alignment is a sequence matching problem focused on unlocking protein information embedded in the genetic code, making it possible to assemble a "tree of life" by comparing new sequences against all sequences from known organisms. But, the memory footprint of sequence data is growing more rapidly than per-node core memory. Despite years of research and development, high-performance sequence alignment applications either do not scale well, cannot accommodate very large databases in core, or require special hardware. We have developed a high-performance sequence alignment application, ScalaBLAST, which accommodates very large databases and which scales linearly to as many as thousands of processors on both distributed memory and shared memory architectures, representing a substantial improvement over the current state-of-the-art in high-performance sequence alignment with scaling and portability. ScalaBLAST relies on a collection of techniques - distributing the target database over available memory, multilevel parallelism to exploit concurrency, parallel I/O, and latency hiding through data prefetching - to achieve high-performance and scalability. This demonstrated approach of database sharing combined with effective task scheduling should have broad ranging applications to other informatics-driven sciences

[1]  Anthony K. H. Tung,et al.  Piers: an efficient model for similarity search in DNA sequence databases , 2004, SGMD.

[2]  Keith D. Underwood,et al.  RC-BLAST: towards a portable, cost-effective open source hardware implementation , 2005, IEEE International Parallel and Distributed Processing Symposium.

[3]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[4]  Thomas L. Casavant,et al.  Identifying Candidate Disease Genes with High-Performance Computing , 2003, The Journal of Supercomputing.

[5]  T. Rognes,et al.  ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. , 2001, Nucleic acids research.

[6]  M. Waterman,et al.  Comparative biosequence metrics , 2005, Journal of Molecular Evolution.

[7]  Jarek Nieplocha,et al.  Exploiting processor groups to extend scalability of the GA shared memory programming model , 2005, CF '05.

[8]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[9]  David R. Mathog,et al.  Parallel BLAST on split databases , 2003, Bioinform..

[10]  Robert D. Bjornson,et al.  TurboBLAST : a parallel implementation of blast built on the turbohub , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[11]  Chunlin Wang,et al.  SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters , 2004, BMC Bioinformatics.

[12]  Michael Kaufmann,et al.  DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors , 2004, BMC Bioinformatics.

[13]  M S Waterman,et al.  Overlapping genes and information theory. , 1981, Journal of theoretical biology.

[14]  Tjerk P. Straatsma,et al.  A Multiprotocol Communication Support for the Global Address Space Programming Model on the IBM SP , 2000, Euro-Par.

[15]  Nagiza F. Samatova,et al.  Efficient data access for parallel BLAST , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[16]  Jarek Nieplocha,et al.  Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit , 2006, Int. J. High Perform. Comput. Appl..

[17]  Roland L. Dunbrack,et al.  BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf cluster , 2002, Bioinform..

[18]  Thomas L. Casavant,et al.  Parallelization of local BLAST service on workstation clusters , 2001, Future Gener. Comput. Syst..

[19]  Xiandong Meng,et al.  Bio-sequence analysis with cradle's 3SoC™ software scalable system on chip , 2004, SAC '04.

[20]  Jun S. Liu,et al.  BALSA: Bayesian algorithm for local sequence alignment. , 2002, Nucleic acids research.

[21]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[22]  Denis C. Shields,et al.  Wrapping up BLAST and other applications for use on Unix clusters , 2003, Bioinform..

[23]  Jiren Wang,et al.  Soap-HT-BLAST: high throughput BLAST based on Web services , 2003, Bioinform..

[24]  Jorge F. Reyes-Spindola,et al.  Radical SAM, a novel protein superfamily linking unresolved steps in familiar biosynthetic pathways with radical mechanisms: functional characterization using new analysis and information visualization methods. , 2001, Nucleic acids research.

[25]  Leonid Oliker,et al.  Scientific Computations on Modern Parallel Vector Systems , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[26]  Vasiliki Hartonas-Garmhausen,et al.  Distributing the comparison of DNA and protein sequences across heterogeneous supercomputers , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[27]  Rogério Luís de Carvalho Costa,et al.  Database Allocation Strategies for Parallel BLAST Evaluation on Clusters , 2004, Distributed and Parallel Databases.

[28]  Robert J. Harrison,et al.  Global arrays: A nonuniform memory access programming model for high-performance computers , 1996, The Journal of Supercomputing.

[29]  Martin Vingron,et al.  Large scale hierarchical clustering of protein sequences , 2005, BMC Bioinformatics.

[30]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.