论文信息 - Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis

Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis

Reuse distance analysis (RDA) is a popular method for calculating locality profiles and modeling cache performance. The present article proposes a framework to apply the RDA algorithm to obtain reuse distance profiles in graphics processing unit (GPU) kernels. To study the implications of hardware-related parameters in RDA, two RDA algorithms were employed, including a high-level cache-independent RDA algorithm, called HLRDA, and a detailed RDA algorithm, called DRDA. DRDA models the effects of reservation fails in cache blocks and miss status holding registers to provide accurate cache-related performance metrics. In this case, the reuse profiles are cache-specific. In a selection of GPU kernels, DRDA obtained the L1 miss-rate breakdowns with an average error of 3.86% and outperformed the state-of-the-art RDA in terms of accuracy. In terms of performance, DRDA is 246,000× slower than the real GPU executions and 11× faster than GPGPU-Sim. HLRDA ignores the cache-related parameters and its obtained reuse profiles are general, which can be used to calculate miss rates in all cache sizes. Moreover, the average error incurred by HLRDA was 16.9%.

Amir Rajabzadeh | Mohsen Kiani | Amir Rajabzadeh | Mohsen Kiani

[1] Hsien-Hsin S. Lee,et al. GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[2] Mohamed Zahran,et al. SACAT: Streaming-Aware Conflict-Avoiding Thrashing-Resistant GPGPU Cache Management Scheme , 2017, IEEE Transactions on Parallel and Distributed Systems.

[3] Yang Yang,et al. A Highly Parallel Reuse Distance Analysis Algorithm on GPUs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[4] Chen Ding,et al. A Composable Model for Analyzing Locality of Multi-threaded Programs , 2009 .

[5] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[6] Richard W. Vuduc,et al. A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[7] Donald Yeung,et al. Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis , 2012, MSPC '12.

[8] Hyesoon Kim,et al. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[9] Derek L. Schuff,et al. Multicore-aware reuse distance analysis , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[10] Yang Zhang,et al. Locality based warp scheduling in GPGPUs , 2018, Future Gener. Comput. Syst..

[11] Xipeng Shen,et al. Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? , 2010, CC.

[12] Dongwei Wang,et al. A reuse distance based performance analysis on GPU L1 data cache , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[13] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[14] Henk Corporaal,et al. A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[15] David Eklov,et al. StatStack: Efficient modeling of LRU caches , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[16] Rachata Ausavarungnirun,et al. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[18] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[19] Kristof Beyls,et al. Reuse Distance as a Metric for Cache Behavior. , 2001 .

[20] Tao Tang,et al. Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[21] Kyu Yeun Kim,et al. Quantifying the performance and energy efficiency of advanced cache indexing for GPGPU computing , 2016, Microprocess. Microsystems.

[22] Wentao Chang,et al. Sampling-based program locality approximation , 2008, ISMM '08.

[23] C. Cascaval,et al. Calculating stack distances efficiently , 2003, MSP '02.

[24] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25] Donald Yeung,et al. Identifying Power-Efficient Multicore Cache Hierarchies via Reuse Distance Analysis , 2016, TOCS.

[26] David R. Kaeli,et al. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[27] Donald Yeung,et al. Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[28] Jungwon Kim,et al. A Performance Model for GPUs with Caches , 2015, IEEE Transactions on Parallel and Distributed Systems.

[29] Franz Franchetti,et al. Accelerating Architectural Simulation Via Statistical Techniques: A Survey , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[30] Karsten Schwan,et al. A framework for dynamically instrumenting GPU compute applications within GPU Ocelot , 2011, GPGPU-4.

[31] Yu Wang,et al. Optimizing Cache Bypassing and Warp Scheduling for GPUs , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[32] Wen-mei W. Hwu,et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[33] Chao Li,et al. A model-driven approach to warp/thread-block level GPU cache bypassing , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[34] YeungDonald,et al. Efficient Reuse Distance Analysis of Multicore Scaling for Loop-Based Parallel Programs , 2013 .

[35] Ali Akoglu,et al. Application-Specific Autonomic Cache Tuning for General Purpose GPUs , 2017, 2017 International Conference on Cloud and Autonomic Computing (ICCAC).

[36] Donald Yeung,et al. Coherent Profiles: Enabling Efficient Reuse Distance Analysis of Multicore Scaling for Loop-based Parallel Programs , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[37] Milind Kulkarni,et al. Accelerating multicore reuse distance analysis with sampling and parallelization , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38] Amir Rajabzadeh,et al. VLAG: A very fast locality approximation model for GPU kernels with regular access patterns , 2017, 2017 7th International Conference on Computer and Knowledge Engineering (ICCKE).