VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

Performance modeling plays an important role for optimal hardware design and optimized application implementation. This paper presents a very low overhead performance model, called VLAG, to approximate the data localities exploited by GPU kernels. VLAG receives source code-level information to estimate per memory-access instruction, per data array, and per kernel localities within GPU kernels. VLAG is only applicable to kernels with regular memory access patterns. VLAG was experimentally evaluated using an NVIDIA Maxwell GPU. For two different Matrix Multiplication kernels, the average errors of 7.68% and 6.29%, was resulted, respectively. The slowdown of VLAG for MM was measured 1.4X which, comparing with other approaches such as trace-driven simulation, is negligible.

[1]  Kyu Yeun Kim,et al.  Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing , 2016, Microprocess. Microsystems.

[2]  Tao Tang,et al.  Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[3]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[4]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[5]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Zhen Lin,et al.  Automatic data placement into GPU on-chip memory resources , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[7]  Michael Goesele,et al.  Guided profiling for auto-tuning array layouts on GPUs , 2015, PMBS '15.

[8]  Sangpil Lee,et al.  Parallel GPU Architecture Simulation Framework Exploiting Architectural-Level Parallelism with Timing Error Prediction , 2016, IEEE Transactions on Computers.

[9]  Jungwon Kim,et al.  A Performance Model for GPUs with Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[10]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[11]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[12]  K. Srinathan,et al.  A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).

[13]  Sudhakar Yalamanchili,et al.  Modeling GPU-CPU workloads and systems , 2010, GPGPU-3.

[14]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[15]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[16]  Wen-mei W. Hwu,et al.  Analytical Performance Prediction for Evaluation and Tuning of GPGPU Applications , 2009 .

[17]  Wen-mei W. Hwu,et al.  What is ahead for parallel computing , 2014, J. Parallel Distributed Comput..

[18]  Hsien-Hsin S. Lee,et al.  GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Jianliang Xu,et al.  GPURoofline: A Model for Guiding Performance Optimizations on GPUs , 2012, Euro-Par.