Impact of Warp Formation on GPU Performance

dramatically, the GPU is widely used for general-purpose parallel applications as well as graphics applications. Especially, programmers using the GPU can easily create multiple threads with the help of APIs provided by GPU vendors. In GPU architecture, threads are grouped into a warp to run on the SIMD pipeline, leading to high performance. However, computational resources of GPU are not fully utilized in executing general-purpose applications due to control-flow instructions, resulting in performance degradation. To improve the GPU performance, several warp formations for handling branch divergence due to control-flow instructions have been proposed. In this work, we analyze the GPU performance according to warp formations with real GPU hardware configuration. Our simulation results show that the warp formation providing high hardware utilization does not guarantee high performance if hardware resources are not fully supported. Therefore, hardware configuration should be considered together with hardware utilization to improve the GPU performance by using warp formation.

[1]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .

[2]  Erik Lindholm,et al.  A user-programmable vertex engine , 2001, SIGGRAPH.

[3]  Norman P. Jouppi,et al.  Available instruction-level parallelism for superscalar and superpipelined machines , 1989, ASPLOS III.

[4]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[5]  Adam Levinthal,et al.  Chap - a SIMD graphics processor , 1984, SIGGRAPH.

[6]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[7]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[8]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[10]  Andrew Brownsword,et al.  Hardware transactional memory for GPU architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[12]  Quinn Jacobson,et al.  A study of control independence in superscalar processors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[13]  Jukka Helin,et al.  Performance analysis of the CM-2, a massively parallel SIMD computer , 1992, ICS '92.