The general-purpose graphics processing unit (GPGPU) continues to make significant strides in high-end computing by delivering unprecedented performance at a commodity price. However, the many-core architecture of the GPGPU currently allows only data-parallel applications to extract the full potential out of the hardware. Applications that require frequent synchronization during their execution do not experience much performance gain out of the GPGPU. This is mainly due to the lack of explicit hardware or software support for inter thread communication across the entire GPGPU chip.
In this paper, we design, implement, and evaluate a highly-efficient software barrier that synchronizes all the thread blocks running on an offloaded kernel on the GPGPU without having to transfer execution control back to the host processor. We show that our custom software barrier achieves a three-fold performance improvement over the existing approach, i.e., synchronization via the host processor.
To illustrate the aforementioned performance benefit, we parallelize a data-serial application, specifically an optimal sequence-search algorithm called Smith-Waterman (SWat), that requires frequent barrier synchronization across the many cores of the nVIDIA GeForce GTX 280 GPGPU. Our parallelization consists of a suite of optimization techniques — optimal data layout, coalesced memory accesses, and blocked data decomposition. Then, when coupled with our custom software-barrier implementation, we achieve nearly a nine-fold speed-up over the serial implementation of SWat. We also show that our solution delivers 25 faster on-chip execution than the na¨ive implementation.
[1]
Weiguo Liu,et al.
Bio-sequence database scanning on a GPU
,
2006,
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[2]
Jack Dongarra,et al.
Computational Science - ICCS 2004
,
2004,
Lecture Notes in Computer Science.
[3]
Wu-chun Feng,et al.
Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine
,
2008,
CF '08.
[4]
Giorgio Valle,et al.
CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment
,
2008,
BMC Bioinformatics.
[5]
Wu-chun Feng,et al.
Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications
,
2006,
ACM/IEEE SC 2006 Conference (SC'06).
[6]
Yang Liu,et al.
GPU Accelerated Smith-Waterman
,
2006,
International Conference on Computational Science.
[7]
M S Waterman,et al.
Identification of common molecular subsequences.
,
1981,
Journal of molecular biology.
[8]
Samuel Williams,et al.
The Landscape of Parallel Computing Research: A View from Berkeley
,
2006
.
[9]
Klaus Schulten,et al.
Accelerating Molecular Modeling Applications with GPU Computing
,
2009
.