GPGPU Memory Estimation and Optimization Targeting OpenCL Architecture

The enormous computational power available in modern graphics processing units (GPUs) has enabled the widely use of them for general-purpose applications. However, manual development of high-performance parallel codes for GPUs is still very challenging. In order to fully exploit the capability of GPU for general purpose computing under heterogeneous processing platforms, we propose performance estimation and optimization methods targeting OpenCL architecture. Our approach is to utilize polyhedral representation of a source program in order to optimize and allocate global memory and fast memory of GPUs. By checking the memory access patterns of the program, we discover access instances those can be grouped together using graph coloring. Subsequently, we estimate the memory performance of this program, with the purpose of eliminating the uncoalesced global memory accesses. Then, we utilize data space transformation to alter the irregular memory access patterns for the sake of improving the off-chip memory bandwidth by taking advantage of vector data types. Meanwhile, we detect the reuse information to allocate data into distinct fast memory regions according to both the properties of data accesses and the characteristics of the OpenCL memory model, with the purpose of making best usage of the fast on-chip memory. Experimental results on an AMD/ATI HD5850 GPU for a set of commonly-used benchmarks show that we achieve 2.1X~6.7X speedup with respect to the un-optimized versions and the present global memory performance model can estimate the global memory performance relative accurately.

[1]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[2]  Cédric Bastoul,et al.  Code generation in the polyhedral model is easier than you think , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[3]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[4]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[6]  Anjul Patney,et al.  Efficient computation of sum-products on GPUs through software-managed cache , 2008, ICS '08.

[7]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[8]  Mike Houston,et al.  GPUs a closer look , 2008, SIGGRAPH '08.

[9]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[10]  David R. Kaeli,et al.  Data transformations enabling loop vectorization on multithreaded data parallel architectures , 2010, PPoPP '10.

[11]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[12]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[13]  Naga K. Govindaraju,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007 .

[14]  David R. Kaeli,et al.  Architecture-aware optimization targeting multithreaded stream computing , 2009, GPGPU-2.

[15]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[16]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[17]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[18]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[19]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[20]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[21]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[22]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23]  Albert Cohen,et al.  Iterative optimization in the polyhedral model: part ii, multidimensional time , 2008, PLDI '08.

[24]  Wayne Luk,et al.  Using Reconfigurable Logic to Optimise GPU Memory Accesses , 2008, 2008 Design, Automation and Test in Europe.

[25]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.