SIMD re-convergence at thread frontiers

Hardware and compiler techniques for mapping data-parallel programs with divergent control flow to SIMD architectures have recently enabled the emergence of new GPGPU programming models such as CUDA, OpenCL, and DirectX Compute. The impact of branch divergence can be quite different depending upon whether the program's control flow is structured or unstructured. In this paper, we show that unstructured control flow occurs frequently in applications and can lead to significant code expansion when executed using existing approaches for handling branch divergence. This paper proposes a new technique for automatically mapping arbitrary control flow onto SIMD processors that relies on a concept of a Thread Frontier, which is a bounded region of the program containing all threads that have branched away from the current warp. This technique is evaluated on a GPU emulator configured to model i) a commodity GPU (Intel Sandybridge), and ii) custom hardware support not realized in current GPU architectures. It is shown that this new technique performs identically to the best existing method for structured control flow, and re-converges at the earliest possible point when executing unstructured control flow. This leads to i) between 1.5 – 633.2% reductions in dynamic instruction counts for several real applications, ii) simplification of the compilation process, and iii) ability to efficiently add high level unstructured programming constructs (e.g., exceptions) to existing data-parallel languages.

[1]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[2]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[3]  Adam Levinthal,et al.  Chap - a SIMD graphics processor , 1984, SIGGRAPH.

[4]  Michael T. Goodrich,et al.  A bridging model for parallel computation, communication, and I/O , 1996, CSUR.

[5]  Jeffrey S. Vetter,et al.  Quantifying NUMA and contention effects in multi-GPU systems , 2011, GPGPU-4.

[6]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Aldo Badano,et al.  Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit. , 2009, Medical physics.

[8]  Erik H. D'Hollander,et al.  Using hammock graphs to structure programs , 2004, IEEE Transactions on Software Engineering.

[9]  Takao Hatazaki Tsubame-2 - a 2.4 PFLOPS peak performance system , 2011, 2011 Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference.

[10]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[11]  Andreas Dietrich,et al.  OptiX: a general purpose ray tracing engine , 2010, SIGGRAPH 2010.

[12]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[13]  Hoai Bac Le,et al.  GPU Implementation of Extended Gaussian Mixture Model for Background Subtraction , 2010, 2010 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).

[14]  Cole Trapnell,et al.  Optimizing data intensive GPGPU computations for DNA sequence alignment , 2009, Parallel Comput..

[15]  Ugo Erra,et al.  Real-Time Adaptive GPU Multiagent Path Planning , 2012 .

[16]  David A Boas,et al.  Monte Carlo simulation of photon migration in 3D turbid media accelerated by graphics processing units. , 2009, Optics express.

[17]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Sudhakar Yalamanchili,et al.  Characterization and Transformation of Unstructured Control Flow in GPU Applications , 2011 .

[19]  David K. McAllister,et al.  OptiX: a general purpose ray tracing engine , 2010, ACM Trans. Graph..

[20]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[21]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).