Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.

[1]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[2]  Jens H. Krüger,et al.  GPGPU: general purpose computation on graphics hardware , 2004, SIGGRAPH '04.

[3]  Jack J. Dongarra,et al.  A comparative study of automatic vectorizing compilers , 1991, Parallel Comput..

[4]  David R. Kaeli,et al.  Data transformations enabling loop vectorization on multithreaded data parallel architectures , 2010, PPoPP '10.

[5]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[6]  Wonyong Sung,et al.  Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware , 2008, CASES '08.

[7]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[8]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[9]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[10]  David R. Kaeli,et al.  Architecture-aware optimization targeting multithreaded stream computing , 2009, GPGPU-2.

[11]  Anjul Patney,et al.  Efficient computation of sum-products on GPUs through software-managed cache , 2008, ICS '08.

[12]  David R. Kaeli,et al.  Multi GPU implementation of iterative tomographic reconstruction algorithms , 2009, 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro.

[13]  Wonyong Sung,et al.  Access-Pattern-Aware On-Chip Memory Allocation for SIMD Processors , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[14]  Uday Bondhugula,et al.  Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories , 2008, PPoPP.

[15]  N.K. Govindaraju,et al.  A Memory Model for Scientific Algorithms on Graphics Processors , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[16]  Ronald Fedkiw,et al.  Robust quasistatic finite elements and flesh simulation , 2005, SCA '05.

[17]  Corinna G. Lee,et al.  Simple vector microprocessors for multimedia applications , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  Robert A. van de Geijn,et al.  BLAS (Basic Linear Algebra Subprograms) , 2011, Encyclopedia of Parallel Computing.

[19]  nVIDIA社 CUDA Programming Guide 1.1 , 2007 .