Divergence Analysis and Optimizations

The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.

[1]  Cristina Cifuentes,et al.  User-Input Dependence Analysis via Graph Reachability , 2008, 2008 Eighth IEEE International Working Conference on Source Code Analysis and Manipulation.

[2]  Jaewook Shin Introducing Control Flow into Vectorized Code , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[3]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[4]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Dorota H. Kieronska,et al.  Formal Specification of Parallel SIMD Execution , 1996, Theor. Comput. Sci..

[6]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[7]  Ken Kennedy,et al.  Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.

[8]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[9]  Michael Stepp,et al.  Equality saturation: a new approach to optimization , 2009, POPL '09.

[10]  J. Nickolls Graphics and Computing GPUs , 2022 .

[11]  Guy E. Blelloch,et al.  Vcode: a data-parallel intermediate language , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.

[12]  Michael Garland,et al.  Understanding throughput-oriented architectures , 2010, Commun. ACM.

[13]  Rolf Wanka,et al.  Efficient oblivious parallel sorting on the MasPar MP-1 , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[14]  Etienne Morel,et al.  Global optimization by suppression of partial redundancies , 1979, CACM.

[15]  Andrew W. Appel,et al.  Modern Compiler Implementation in Java , 1997 .

[16]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Luc Bougé,et al.  Control structures for data-parallel SIMD languages: semantics and implementation , 1992, Future Gener. Comput. Syst..

[18]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[19]  Philippas Tsigas,et al.  GPU-Quicksort: A practical Quicksort algorithm for graphics processors , 2010, JEAL.

[20]  Arthur B. Maccabe,et al.  The program dependence web: a representation supporting control-, data-, and demand-driven interpretation of imperative languages , 1990, PLDI '90.

[21]  Satoshi Matsuoka,et al.  Massive supercomputing coping with heterogeneity of modern accelerators , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[22]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[23]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[24]  D. V. Bhaskar Rao,et al.  Wavefront Array Processor: Language, Architecture, and Applications , 1982, IEEE Transactions on Computers.

[25]  Fernando Magno Quintão Pereira,et al.  Performance Debugging of GPGPU Applications with the Divergence Map , 2010, 2010 22nd International Symposium on Computer Architecture and High Performance Computing.

[26]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[28]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[29]  Yoichi Muraoka,et al.  TRANQUIL: a language for an array processing computer , 1969, AFIPS '69 (Spring).

[30]  Bobby Bodenheimer,et al.  Synthesis and evaluation of linear motion transitions , 2008, TOGS.

[31]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[32]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[33]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[34]  Duncan H. Lawrie,et al.  Glypnir—a programming language for Illiac IV , 1975, Commun. ACM.

[35]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[36]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[37]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[38]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[39]  Yao Zhang,et al.  Dynamic Detection of Uniform and Affine Vectors in GPGPU Computations , 2009, Euro-Par Workshops.

[40]  Ronald H. Perrott,et al.  A Language for Array and Vector Processors , 1979, TOPL.

[41]  Ronan Keryell,et al.  POMP or How to Design a Massively Parallel Machine with Small Developments , 1991, PARLE.

[42]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[43]  D. Mount Bioinformatics: Sequence and Genome Analysis , 2001 .