Statistical pattern based modeling of GPU memory access streams

Recent research studies have shown that modern GPU performance is often limited by the memory system performance. Optimizing memory hierarchy performance requires GPU designers to draw design insights based on the cache & memory behavior of end-user applications. Unfortunately, it is often difficult to get access to end-user workloads due to the confidential or proprietary nature of the software/data. Furthermore, the efficiency of early design space exploration of cache & memory systems is often limited due to either the slow speed of detailed simulation techniques or limited scope of state-of-the-art cache analytical models. To enable efficient GPU memory system exploration, we present a novel methodology and framework that statistically models the GPU memory access stream locality. The proposed G-MAP (GPU Memory Access Proxy) framework models the regularity in codelocalized memory access patterns of GPGPU applications and the parallelism in GPU's execution model to create miniaturized memory proxies. We evaluate G-MAP using 18 GPGPU benchmarks and show that G-MAP proxies can replicate cache/memory performance of original applications with over 90% accuracy across over 5000 different L1/L2 cache, prefetcher and memory configurations.

[1]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Reena Panda,et al.  Prefetching Techniques for Near-memory Throughput Processors , 2016, ICS.

[4]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[5]  Lieven Eeckhout,et al.  Performance Cloning: A Technique for Disseminating Proprietary Applications as Benchmarks , 2006, 2006 IEEE International Symposium on Workload Characterization.

[6]  Carole-Jean Wu,et al.  Characterizing the latency hiding ability of GPUs , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[8]  Reena Panda,et al.  Accurate address streams for LLC and beyond (SLAB): A methodology to enable system exploration , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[11]  Lizy Kurian John,et al.  Synthesizing memory-level parallelism aware miniature clones for SPEC CPU2006 and ImplantBench workloads , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[12]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[13]  Tao Tang,et al.  Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[14]  Hai Jin,et al.  GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation , 2015, IEEE Transactions on Computers.

[15]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[16]  Alper Sen,et al.  MINIME-GPU , 2016, ACM Trans. Archit. Code Optim..

[17]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[18]  Yan Solihin,et al.  STM: Cloning the spatial and temporal memory access behavior , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[19]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.