Fast, accurate, and scalable memory modeling of GPGPUs using reuse profiles

In this paper, we introduce an accurate and scalable memory modeling framework for General Purpose Graphics Processor units (GPGPUs), PPT-GPU-Mem. That is Performance Prediction Tool-Kit for GPUs Cache Memories. PPT-GPU-Mem predicts the performance of different GPUs' cache memory hierarchy (L1 & L2) based on reuse profiles. We extract a memory trace for each GPU kernel once in its lifetime using the recently released binary instrumentation tool, NVBIT. The memory trace extraction is architecture-independent and can be done on any available NVIDIA GPU. PPT-GPU-Mem can then model any NVIDIA GPU caches given their parameters and the extracted memory trace. We model Volta Tesla V100 and Turing TITAN RTX and validate our framework using different kernels from Polybench and Rodinia benchmark suites in addition to two deep learning applications from Tango DNN benchmark suite. We provide two models, MBRDP (Multiple Block Reuse Distance Profile) and OBRDP (One Block Reuse Distance Profile), with varying assumptions, accuracy, and speed. Our accuracy ranges from 92% to 99% for the different cache levels compared to real hardware while maintaining the scalability in producing the results. Finally, we illustrate that PPT-GPU-Mem can be used for design space exploration and for predicting the cache performance of future GPUs.

[1]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[2]  Gopinath Chennupati,et al.  An analytical memory hierarchy model for performance prediction , 2017, 2017 Winter Simulation Conference (WSC).

[3]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[4]  Aristides Efthymiou,et al.  Synthetic Trace-Driven Simulation of Cache Memory , 2007, 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW'07).

[5]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[6]  Gopinath Chennupati,et al.  Scalable Performance Prediction of Codes with Memory Hierarchy and Pipelines , 2019, SIGSIM-PADS.

[7]  Sriram Krishnamoorthy,et al.  Cache miss characterization and data locality optimization for imperfectly nested loops on shared memory multiprocessors , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[8]  Amir Rajabzadeh,et al.  Efficient Cache Performance Modeling in GPUs Using Reuse Distance Analysis , 2018, ACM Trans. Archit. Code Optim..

[9]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[11]  Tao Tang,et al.  Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[12]  C. Cascaval,et al.  Calculating stack distances efficiently , 2003, MSP '02.

[13]  A. Azzouz 2011 , 2020, City.

[14]  Donald Yeung,et al.  Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[15]  Gopinath Chennupati,et al.  Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs , 2019, 2019 IEEE High Performance Extreme Computing Conference (HPEC).

[16]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[17]  Sharon L. Wolchik 1989 , 2009 .

[18]  Mark Horowitz,et al.  An analytical cache model , 1989, TOCS.

[19]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[20]  Gopinath Chennupati,et al.  PPT-GPU: Scalable GPU Performance Modeling , 2019, IEEE Computer Architecture Letters.

[21]  C. Martin 2015 , 2015, Les 25 ans de l’OMC: Une rétrospective en photos.

[22]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[23]  Lieven Eeckhout,et al.  Performance analysis through synthetic trace generation , 2000, 2000 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS (Cat. No.00EX422).

[24]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[25]  Satyajayant Misra,et al.  A Scalable Analytical Memory Model for CPU Performance Prediction , 2017, PMBS@SC.

[26]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[27]  Donald Yeung,et al.  Guiding Locality Optimizations for Graph Computations via Reuse Distance Analysis , 2017, IEEE Computer Architecture Letters.

[28]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[29]  Dongwei Wang,et al.  A reuse distance based performance analysis on GPU L1 data cache , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[30]  Gopinath Chennupati,et al.  Verified instruction-level energy consumption measurement for NVIDIA GPUs , 2020, CF.

[31]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  T. G. Venkatesh,et al.  Analytical Miss Rate Calculation of L2 Cache from the RD Profile of L1 Cache , 2018, IEEE Transactions on Computers.

[33]  Maged M. Michael,et al.  Accuracy and speed-up of parallel trace-driven architectural simulation , 1997, Proceedings 11th International Parallel Processing Symposium.

[34]  Donald Yeung,et al.  Optimizing locality in graph computations using reuse distance profiles , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).

[35]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[36]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[37]  Chen Ding,et al.  Miss rate prediction across all program inputs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[38]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[39]  Steve Carr,et al.  Reuse-distance-based miss-rate prediction on a per instruction basis , 2004, MSP '04.

[40]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[41]  Erich Strohmaier,et al.  Quantifying Locality In The Memory Access Patterns of HPC Applications , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[42]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[43]  Hyeran Jeon,et al.  Detailed Characterization of Deep Neural Networks on GPUs and FPGAs , 2019, GPGPU@ASPLOS.

[44]  Oreste Villa,et al.  NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs , 2019, MICRO.

[45]  Gopinath Chennupati,et al.  GPUs Cache Performance Estimation using Reuse Distance Analysis , 2019, 2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC).

[46]  Krishna M. Kavi,et al.  Gleipnir: a memory profiling and tracing tool , 2013, CARN.