Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops

Iteration Point Difference Analysis is a new static analysis framework that can be used to determine the memory coalescing characteristics of parallel loops that target GPU offloading and to ascertain safety and profitability of loop transformations with the goal of improving their memory access characteristics. This analysis can propagate definitions through control flow, works for non-affine expressions, and is capable of analyzing expressions that reference conditionally defined values. This analysis framework enables safe and profitable loop transformations. Experimental results demonstrate potential for dramatic performance improvements. GPU kernel execution time across the Polybench suite is improved by up to 25.5× on an Nvidia P100 with benchmark overall improvement of up to 3.2×. An opportunity detected in a SPEC ACCEL benchmark yields kernel speedup of 86.5× with a benchmark improvement of 3.3×. This work also demonstrates how architecture-aware compilers improve code portability and reduce programmer effort.

[1]  Sven-Bodo Scholz,et al.  Unibench: A Tool for Automated and Collaborative Benchmarking , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[2]  Sunita Chandrasekaran,et al.  SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance , 2014, PMBS@SC.

[3]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[4]  José Nelson Amaral,et al.  Automated GPU Grid Geometry Selection for OPENMP Kernels , 2018, 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[5]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[6]  Kyle A. Gallivan,et al.  A unified framework for nonlinear dependence testing and symbolic analysis , 2004, ICS '04.

[7]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[8]  Lawrence Rauchwerger,et al.  Logical inference techniques for loop parallelization , 2012, PLDI.

[9]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[10]  Michael Wolfe,et al.  Optimizing supercompilers for supercomputers , 1989, ICS.

[11]  Patrick Cousot,et al.  Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints , 1977, POPL.

[12]  Constantine D. Polychronopoulos,et al.  Symbolic Program Analysis and Optimization for Parallelizing Compilers , 1992, LCPC.

[13]  Graham D. Riley,et al.  Formalizing OpenMP Performance Properties with ASL , 2000, ISHPC.

[14]  Rudolf Eigenmann,et al.  The range test: a dependence test for symbolic, non-linear expressions , 1994, Proceedings of Supercomputing '94.

[15]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[16]  Xuhao Chen,et al.  Performance model for OpenMP parallelized loops , 2011, Proceedings 2011 International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE).

[17]  Sungdo Moon,et al.  Predicated array data-flow analysis for run-time parallelization , 1998, ICS '98.

[18]  Michael F. P. O'Boyle,et al.  Mapping parallelism to multi-cores: a machine learning based approach , 2009, PPoPP '09.

[19]  Scott A. Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 25.

[20]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[21]  Kevin Skadron,et al.  Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.

[22]  Lawrence Rauchwerger,et al.  Scalable conditional induction variables (CIV) analysis , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[23]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[24]  Arthur Stoutchinin,et al.  Efficient static single assignment form for predication , 2001, MICRO.

[25]  D. Zhang,et al.  The value evolution graph and its use in memory reference analysis , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[26]  David Norton,et al.  Performance Portability and OpenACC , 2014 .

[27]  Michael Wolfe,et al.  Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form , 1995, TOPL.

[28]  Thomas E. Cheatham,et al.  Symbolic evaluation of programs: a look at loop analysis , 1976, SYMSAC '76.

[29]  Yunheung Paek,et al.  Efficient and precise array access analysis , 2002, TOPL.