Pointer-Based Divergence Analysis for OpenCL 2.0 Programs

A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This GPU architecture is suitable for applications with a high degree of data parallelism, but its performance degrades seriously when divergence occurs. Many optimizations for divergence have been proposed, and they vary with the divergence information about variables and branches. A previous analysis scheme viewed pointers and return values from functions as divergence directly, and only focused on OpenCL 1.x. In this article, we present a novel scheme that reports the divergence information for pointer-intensive OpenCL programs. The approach is based on extended static single assignment (SSA) and adds some special functions and annotations from memory SSA and gated SSA. The proposed scheme first constructs extended SSA, which is then used to build a divergence relation graph that includes all of the possible points-to relationships of the pointers and initialized divergence states. The divergence state of the pointers can be determined by propagating the divergence state of the divergence relation graph. The scheme is further extended for interprocedural cases by considering function-related statements. The proposed scheme was implemented in an LLVM compiler and can be applied to OpenCL programs. We analyzed 10 programs with 24 kernels, with a total analyzed program size of 1,306 instructions in an LLVM intermediate representation, with 885 variables, 108 branches, and 313 pointer-related statements. The total number of divergent pointers detected was 146 for the proposed scheme, 200 for the scheme in which the pointer was always divergent, and 155 for the current LLVM default scheme; the total numbers of divergent variables detected were 458, 519, and 482, respectively, with 31, 34, and 32 divergent branches. These experimental results indicate that the proposed scheme is more precise than both a scheme in which a pointer is always divergent and the current LLVM default scheme.

[1]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[2]  Sudhakar Yalamanchili,et al.  Dynamic compilation of data-parallel kernels for vector processors , 2012, CGO '12.

[3]  Arthur B. Maccabe,et al.  The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.

[4]  David A. Padua,et al.  Efficient building and placing of gating functions , 1995, PLDI '95.

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Shorin Kyo,et al.  A dynamic SIMD/MIMD mode switching processor for embedded real-time image recognition systems , 2011, IEEE Asian Solid-State Circuits Conference 2011.

[7]  Sebastian Hack,et al.  Improving Performance of OpenCL on CPUs , 2012, CC.

[8]  Sudhakar Yalamanchili,et al.  Characterization and transformation of unstructured control flow in bulk synchronous GPU applications , 2012, Int. J. High Perform. Comput. Appl..

[9]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis and Optimizations , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[10]  Tarek S. Abdelrahman,et al.  Reducing divergence in GPGPU programs with loop merging , 2013, GPGPU@ASPLOS.

[11]  Sebastian Hack,et al.  Partial control-flow linearization , 2018, PLDI.

[12]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[13]  M. Wegman,et al.  Global value numbers and redundant computations , 1988, POPL '88.

[14]  R. Govindarajan,et al.  Taming Control Divergence in GPUs through Control Flow Linearization , 2014, CC.

[15]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[16]  Sylvain Collange,et al.  Fusion of Calling Sites , 2015, 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[17]  Raymond Lo,et al.  Effective Representation of Aliases and Indirect Memory Operations in SSA Form , 1996, CC.

[18]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[19]  Jenq Kuen Lee,et al.  Support of Probabilistic Pointer Analysis in the SSA Form , 2012, IEEE Transactions on Parallel and Distributed Systems.

[20]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[21]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[22]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[23]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[24]  Sebastian Hack,et al.  Whole-function vectorization , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[25]  Shao-Chung Wang,et al.  Architecture and Compiler Support for GPUs Using Energy-Efficient Affine Register Files , 2017, TODE.

[26]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[27]  Fernando Magno Quintão Pereira,et al.  Divergence Analysis with Affine Constraints , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[28]  Magnus Jahre,et al.  Efficient control flow restructuring for GPUs , 2016, 2016 International Conference on High Performance Computing & Simulation (HPCS).

[29]  Fernando Magno Quintão Pereira,et al.  Spill Code Placement for SIMD Machines , 2012, SBLP.

[30]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[31]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[32]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[33]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[34]  Jong-Deok Choi,et al.  Efficient flow-sensitive interprocedural computation of pointer-induced aliases and side effects , 1993, POPL '93.

[35]  Xiaoming Li,et al.  A control-structure splitting optimization for GPGPU , 2009, CF '09.

[36]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[37]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.