Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis

Reuse distance analysis (RDA) is a popular method for calculating locality profiles and modeling cache performance. The present article proposes a framework to apply the RDA algorithm to obtain reuse distance profiles in graphics processing unit (GPU) kernels. To study the implications of hardware-related parameters in RDA, two RDA algorithms were employed, including a high-level cache-independent RDA algorithm, called HLRDA, and a detailed RDA algorithm, called DRDA. DRDA models the effects of reservation fails in cache blocks and miss status holding registers to provide accurate cache-related performance metrics. In this case, the reuse profiles are cache-specific. In a selection of GPU kernels, DRDA obtained the L1 miss-rate breakdowns with an average error of 3.86% and outperformed the state-of-the-art RDA in terms of accuracy. In terms of performance, DRDA is 246,000× slower than the real GPU executions and 11× faster than GPGPU-Sim. HLRDA ignores the cache-related parameters and its obtained reuse profiles are general, which can be used to calculate miss rates in all cache sizes. Moreover, the average error incurred by HLRDA was 16.9%.

[1]  Hsien-Hsin S. Lee,et al.  GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Mohamed Zahran,et al.  SACAT: Streaming-Aware Conflict-Avoiding Thrashing-Resistant GPGPU Cache Management Scheme , 2017, IEEE Transactions on Parallel and Distributed Systems.

[3]  Yang Yang,et al.  A Highly Parallel Reuse Distance Analysis Algorithm on GPUs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[4]  Chen Ding,et al.  A Composable Model for Analyzing Locality of Multi-threaded Programs , 2009 .

[5]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[6]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[7]  Donald Yeung,et al.  Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis , 2012, MSPC '12.

[8]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[9]  Derek L. Schuff,et al.  Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[10]  Yang Zhang,et al.  Locality based warp scheduling in GPGPUs , 2018, Future Gener. Comput. Syst..

[11]  Xipeng Shen,et al.  Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[12]  Dongwei Wang,et al.  A reuse distance based performance analysis on GPU L1 data cache , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[13]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[14]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[15]  David Eklov,et al.  StatStack: Efficient modeling of LRU caches , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[16]  Rachata Ausavarungnirun,et al.  Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[18]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[20]  Tao Tang,et al.  Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[21]  Kyu Yeun Kim,et al.  Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing , 2016, Microprocess. Microsystems.

[22]  Wentao Chang,et al.  Sampling-based program locality approximation , 2008, ISMM '08.

[23]  C. Cascaval,et al.  Calculating stack distances efficiently , 2003, MSP '02.

[24]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Donald Yeung,et al.  Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis , 2016, TOCS.

[26]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[27]  Donald Yeung,et al.  Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[28]  Jungwon Kim,et al.  A Performance Model for GPUs with Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[29]  Franz Franchetti,et al.  Accelerating Architectural Simulation Via Statistical Techniques: A Survey , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[30]  Karsten Schwan,et al.  A framework for dynamically instrumenting GPU compute applications within GPU Ocelot , 2011, GPGPU-4.

[31]  Yu Wang,et al.  Optimizing Cache Bypassing and Warp Scheduling for GPUs , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[32]  Wen-mei W. Hwu,et al.  Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[33]  Chao Li,et al.  A model-driven approach to warp/thread-block level GPU cache bypassing , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  YeungDonald,et al.  Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs , 2013 .

[35]  Ali Akoglu,et al.  Application-Specific Autonomic Cache Tuning for General Purpose GPUs , 2017, 2017 International Conference on Cloud and Autonomic Computing (ICCAC).

[36]  Donald Yeung,et al.  Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[37]  Milind Kulkarni,et al.  Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]  Amir Rajabzadeh,et al.  VLAG: A very fast locality approximation model for GPU kernels with regular access patterns , 2017, 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE).