GPUs Cache Performance Estimation using Reuse Distance Analysis

GPU architects have introduced on-chip memories in GPUs to provide local storage nearby processing to reduce the traffic to the device global memory. From then on-wards, modeling to predict the cache performance has been an active area of research. However, due to the complexities found in this highly parallel hardware, this has not been a straightforward task. In this paper, we propose a memory model to predict the entire cache performance (L1 & L2 caches) in GPUs. Our model is based on reuse distance. We use an analytical probabilistic measure of the reuse distance distributions from the memory traces of an application to predict the hit rates. The application’s memory trace is extracted using NVIDIA’s SASSI instrumentation tool. We use 20 different kernels from Polybench and Rodinia benchmark suites and compare our model to the real hardware. The results show that the average prediction accuracy of the model over all the kernels is 86.7% compared to the real device with higher accuracy for the L2 (95.26%) cache than the L1. Furthermore, extracting the application’s memory trace is on average 4. 9x slower compared to the kernels running without instrumentation. This overhead is much smaller than other published results. Furthermore, our model is very flexible where it takes into account the different cache parameters thus it can be used for design space exploration and sensitivity analysis.

[1]  Satyajayant Misra,et al.  A Scalable Analytical Memory Model for CPU Performance Prediction , 2017, PMBS@SC.

[2]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[3]  Steve Carr,et al.  Reuse-distance-based miss-rate prediction on a per instruction basis , 2004, MSP '04.

[4]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[5]  David A. Padua,et al.  Estimating cache misses and locality using stack distances , 2003, ICS '03.

[6]  Wen-mei W. Hwu,et al.  Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[7]  Mohamed Zahran,et al.  Efficient utilization of GPGPU cache hierarchy , 2015, GPGPU@PPoPP.

[8]  Donald Yeung,et al.  Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis , 2017, IEEE Computer Architecture Letters.

[9]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[10]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[11]  Amir Rajabzadeh,et al.  Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis , 2018, ACM Trans. Archit. Code Optim..

[12]  Tao Tang,et al.  Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[13]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Dongwei Wang,et al.  A reuse distance based performance analysis on GPU L1 data cache , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[15]  Mahmut T. Kandemir,et al.  Studying inter-core data reuse in multicores , 2011, SIGMETRICS '11.

[16]  Gopinath Chennupati,et al.  PPT-GPU: Scalable GPU Performance Modeling , 2019, IEEE Computer Architecture Letters.

[17]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[18]  Gopinath Chennupati,et al.  An analytical memory hierarchy model for performance prediction , 2017, 2017 Winter Simulation Conference (WSC).

[19]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Gopinath Chennupati,et al.  Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[21]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[22]  Bin Wang Mitigating GPU Memory Divergence for Data-Intensive Applications , 2015 .

[23]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[24]  Gopinath Chennupati,et al.  Scalable Performance Prediction of Codes with Memory Hierarchy and Pipelines , 2019, SIGSIM-PADS.

[25]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[26]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[27]  Donald Yeung,et al.  Optimizing locality in graph computations using reuse distance profiles , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).

[28]  Arun Parakh,et al.  Performance Estimation of GPUs with Cache , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[29]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[30]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[31]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[32]  Erik Hagersten,et al.  StatCache: a probabilistic approach to efficient and accurate data locality analysis , 2004, IEEE International Symposium on - ISPASS Performance Analysis of Systems and Software, 2004.