论文信息 - Fine-Grained Synchronizations and Dataflow Programming on GPUs

Fine-Grained Synchronizations and Dataflow Programming on GPUs

The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming model assumes a single instruction stream for multiple concurrent threads (SIMT); therefore little support is offered to enforce thread ordering and fine-grained synchronizations. This becomes an obstacle when migrating algorithms which exploit fine-grained parallelism, to GPUs, such as the dataflow algorithms. In this paper, we propose a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs. We demonstrate its performance and compare it with other fine-grained and medium-grained synchronization approaches. Our method achieves 1.5x speedup over the warp-barrier based approach and 4.0x speedup over the atomic spin-lock based approach on average. To further explore the possibility of realizing fine-grained dataflow algorithms on GPUs, we apply the proposed synchronization scheme to Needleman-Wunsch - a 2D wavefront application involving massive cross-loop data dependencies. Our implementation achieves 3.56x speedup over the atomic spin-lock implementation and 1.15x speedup over the conventional data-parallel implementation for a basic sub-grid, which implies that the fine-grained, lock-based programming pattern could be an alternative choice for designing general-purpose GPU applications (GPGPU).

Henk Corporaal | Ang Li | Gert-Jan van den Braak | Akash Kumar

[1] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[2] Jie Cheng,et al. CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[3] Henry G. Dietz,et al. Dynamic Barrier Architecture for Multi-Mode Fine-Grain Parallelism Using Conventional Processors , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[4] Thomas E. Anderson,et al. The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.

[5] Keshav Pingali,et al. Atomic-free irregular computations on GPUs , 2013, GPGPU@ASPLOS.

[6] David A. Padua,et al. Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[7] Wu-chun Feng,et al. Accelerating Data-Serial Applications on Data-Parallel GPGPUs: A Systems Approach , 2008 .

[8] Leslie Lamport,et al. The parallel execution of DO loops , 1974, CACM.

[9] Edson Cáceres,et al. A Parallel Wavefront Algorithm for Efficient Biological Sequence Comparison , 2003, ICCSA.

[10] Stephen A. Jarvis,et al. Parallelising wavefront applications on general-purpose GPU devices , 2010 .

[11] Brucek Khailany,et al. CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[12] Jack J. Purdum,et al. C programming guide , 1983 .

[13] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[14] John D. Owens,et al. Efficient Synchronization Primitives for GPUs , 2011, ArXiv.

[15] アール．ニコルスジョン,et al. Lock mechanism that enables atomic updates to shared memory , 2009 .

[16] Jun Kong,et al. Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines , 2013, Parallel Comput..

[17] Gadi Taubenfeld. Synchronization Algorithms and Concurrent Programming , 2006 .

[18] Meng-Lai Yin,et al. A parallel implementation of the Smith-Waterman algorithm for massive sequences searching , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[19] Pen-Chung Yew,et al. The impact of synchronization and granularity on parallel systems , 1990, ISCA '90.

[20] Alexandru Nicolau,et al. Techniques for efficient placement of synchronization primitives , 2009, PPoPP '09.

[21] Dean M. Tullsen,et al. Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[22] Alexander Aiken,et al. Singe: leveraging warp specialization for high performance on GPUs , 2014, PPoPP '14.

[23] Feng Ji,et al. Using Shared Memory to Accelerate MapReduce on Graphics Processing Units , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[24] John D. Owens,et al. GPU Computing , 2008, Proceedings of the IEEE.

[25] Wen-mei W. Hwu,et al. GPU Computing Gems Emerald Edition , 2011 .

[26] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[27] Thomas E. Anderson,et al. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..

[28] José Ignacio Benavides Benítez,et al. Performance Modeling of Atomic Additions on GPU Scratchpad Memory , 2013, IEEE Transactions on Parallel and Distributed Systems.