Highly efficient mapping of the Smith-Waterman algorithm on CUDA-compatible GPUs

This paper describes a multi-threaded parallel design and implementation of the Smith-Waterman (SW) algorithm on graphic processing units (GPUs) with NVIDIA corporation's Compute Unified Device Architecture (CUDA). Central to this is a divide and conquer approach which divides the computation of a whole pairwise sequence alignment matrix into multiple sub-matrices (or parallelograms) each running efficiently on the available hardware resources of the GPU in hand, with temporary intermediate data stored in global memory. Moreover, we use thread warps and padding techniques in order to decrease the cost of thread synchronization, as well as loop unrolling in order to reduce the cost of conditional branches. While intermediate data is stored in global memory for large queries, the most inner loop in our implementation will only access shared memory and registers. As a result of these optimizations, our implementation of the SW algorithm achieves a throughput ranging between 9.09 GCUPS (Giga Cell Update per Second) and 12.71 GCUPS on a single-GPU version, and a throughput between 29.46 GCUPS and 43.05 GCUPS on a quad-GPU platform. Compared with the best GPU implementation of the SW algorithm reported to date, our implementation achieves up to 46 % improvement in speed. The source code of our implementation is available in the public domain for Bioinformaticians to benefit from its performance.

[1]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[2]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Michael S. Farrar Optimizing Smith-Waterman for the Cell Broadband Engine , 2008 .

[4]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[5]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[6]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[7]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[10]  Ying Liu,et al.  A Highly Parameterized and Efficient FPGA-Based Skeleton for Pairwise Biological Sequence Alignment , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Christophe Dessimoz,et al.  SWPS3 – fast multi-threaded vectorized Smith-Waterman for IBM Cell/B.E. and ×86/SSE2 , 2008, BMC Research Notes.

[12]  Pedro Trancoso,et al.  Initial Experiences Porting a Bioinformatics Application to a Graphics Processor , 2005, Panhellenic Conference on Informatics.