Fine-Grained Synchronizations and Dataflow Programming on GPUs

The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming model assumes a single instruction stream for multiple concurrent threads (SIMT); therefore little support is offered to enforce thread ordering and fine-grained synchronizations. This becomes an obstacle when migrating algorithms which exploit fine-grained parallelism, to GPUs, such as the dataflow algorithms. In this paper, we propose a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs. We demonstrate its performance and compare it with other fine-grained and medium-grained synchronization approaches. Our method achieves 1.5x speedup over the warp-barrier based approach and 4.0x speedup over the atomic spin-lock based approach on average. To further explore the possibility of realizing fine-grained dataflow algorithms on GPUs, we apply the proposed synchronization scheme to Needleman-Wunsch - a 2D wavefront application involving massive cross-loop data dependencies. Our implementation achieves 3.56x speedup over the atomic spin-lock implementation and 1.15x speedup over the conventional data-parallel implementation for a basic sub-grid, which implies that the fine-grained, lock-based programming pattern could be an alternative choice for designing general-purpose GPU applications (GPGPU).

[1]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[3]  Henry G. Dietz,et al.  Dynamic Barrier Architecture for Multi-Mode Fine-Grain Parallelism Using Conventional Processors , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[4]  Thomas E. Anderson,et al.  The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.

[5]  Keshav Pingali,et al.  Atomic-free irregular computations on GPUs , 2013, GPGPU@ASPLOS.

[6]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[7]  Wu-chun Feng,et al.  Accelerating Data-Serial Applications on Data-Parallel GPGPUs: A Systems Approach , 2008 .

[8]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[9]  Edson Cáceres,et al.  A Parallel Wavefront Algorithm for Efficient Biological Sequence Comparison , 2003, ICCSA.

[10]  Stephen A. Jarvis,et al.  Parallelising wavefront applications on general-purpose GPU devices , 2010 .

[11]  Brucek Khailany,et al.  CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12]  Jack J. Purdum,et al.  C programming guide , 1983 .

[13]  Guang R. Gao,et al.  Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[14]  John D. Owens,et al.  Efficient Synchronization Primitives for GPUs , 2011, ArXiv.

[15]  アール. ニコルス ジョン,et al.  Lock mechanism that enables atomic updates to shared memory , 2009 .

[16]  Jun Kong,et al.  Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines , 2013, Parallel Comput..

[17]  Gadi Taubenfeld Synchronization Algorithms and Concurrent Programming , 2006 .

[18]  Meng-Lai Yin,et al.  A parallel implementation of the Smith-Waterman algorithm for massive sequences searching , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[19]  Pen-Chung Yew,et al.  The impact of synchronization and granularity on parallel systems , 1990, ISCA '90.

[20]  Alexandru Nicolau,et al.  Techniques for efficient placement of synchronization primitives , 2009, PPoPP '09.

[21]  Dean M. Tullsen,et al.  Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[22]  Alexander Aiken,et al.  Singe: leveraging warp specialization for high performance on GPUs , 2014, PPoPP '14.

[23]  Feng Ji,et al.  Using Shared Memory to Accelerate MapReduce on Graphics Processing Units , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[24]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[25]  Wen-mei W. Hwu,et al.  GPU Computing Gems Emerald Edition , 2011 .

[26]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[27]  Thomas E. Anderson,et al.  The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[28]  José Ignacio Benavides Benítez,et al.  Performance Modeling of Atomic Additions on GPU Scratchpad Memory , 2013, IEEE Transactions on Parallel and Distributed Systems.