论文信息 - Simulation and architecture improvements of atomic operations on GPU scratchpad memory

Simulation and architecture improvements of atomic operations on GPU scratchpad memory

GPUs are increasingly used as compute accelerators. With a large number of cores executing an even larger number of threads, significant speed-ups can be attained for parallel workloads. Applications that rely on atomic operations, such as histogram and Hough transform, suffer from serialization of threads in case they update the same memory location. Previous work shows that reducing this serialization with software techniques can increase performance by an order of magnitude. We observe, however, that some serialization remains and still slows down these applications. Therefore, this paper proposes to use a hash function in both the addressing of the banks and the locks of the scratchpad memory. To measure the effects of these changes, we first implement a detailed model of atomic operations on scratchpad memory in GPGPU-Sim, and verify its correctness. Second, we test our proposed hardware changes. They result in a speed-up up to 4.9× and 1.8× on implementations utilizing the aforementioned software techniques for histogram and Hough transform applications respectively, with minimum hardware costs.

[1] Zhao Zhang,et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[2] Neil Burgess. Fast Ripple-Carry Adders in Standard-Cell CMOS VLSI , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[3] Rodney A. Kennedy,et al. Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices , 2007 .

[4] Samuel H. Fuller,et al. Computing Performance: Game Over or Next Level? , 2011, Computer.

[5] Koen De Bosschere,et al. XOR-based hash functions , 2005, IEEE Transactions on Computers.

[6] Henk Corporaal,et al. Fast Hough Transform on GPUs: Exploration of Algorithm Trade-Offs , 2011, ACIVS.

[7] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8] Yao Zhang,et al. A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[9] Georgi Gaydadjiev,et al. Elastic pipeline: addressing GPU on-chip shared memory bank conflicts , 2011, CF '11.

[10] José Ignacio Benavides Benítez,et al. An optimized approach to histogram computation on GPU , 2012, Machine Vision and Applications.

[11] José Ignacio Benavides Benítez,et al. Performance Modeling of Atomic Additions on GPU Scratchpad Memory , 2013, IEEE Transactions on Parallel and Distributed Systems.

[12] Richard O. Duda,et al. Use of the Hough transformation to detect lines and curves in pictures , 1972, CACM.

[13] Henk Corporaal,et al. High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs , 2011, GPGPU-4.

[14] アール．ニコルスジョン,et al. Lock mechanism that enables atomic updates to shared memory , 2009 .

[15] Naga K. Govindaraju,et al. Fast scan algorithms on graphics processors , 2008, ICS '08.

[16] Henk Corporaal,et al. GPU-Vote: A Framework for Accelerating Voting Algorithms on GPU , 2012, Euro-Par.

[17] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Yong Wang,et al. Machine Vision and Applications , 2013 .