Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling

In computer architecture, significant innovation frequently comes from industry. However, the simulation tools used by industry are often not released for open use, and even when they are, the exact details of industrial designs are not disclosed. As a result, research in the architecture space must ensure that assumptions about contemporary processor design remain true.To help bridge the gap between opaque industrial innovation and public research, we introduce three mechanisms that make it much easier for GPU simulators to keep up with industry. First, we introduce a new GPU simulator frontend that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA’s native machine ISA, while still supporting execution-driven simulation of the virtual ISA. Second, we extensively update GPGPU-Sim’s performance model to increase its level of detail, configurability and accuracy. Finally, surrounding the new frontend and flexible performance model is an infrastructure that enables quick, detailed validation. A comprehensive set of microbenchmarks and automated correlation plotting ease the modeling process.We use these three new mechanisms to build Accel-Sim, a detailed simulation framework that decreases cycle error 79 percentage points, over a wide range of 80 workloads, consisting of 1,945 kernel instances. We further demonstrate that Accel-Sim is able to simulate benchmark suites that no other open-source simulator can. In particular, we use Accel-sim to simulate an additional 60 workloads, comprised of 11,440 kernel instances, from the machine learning benchmark suite Deepbench. Deepbench makes use of closed-source, hand-tuned kernels with no virtual ISA implementation. Using a rigorous counter-by-counter analysis, we validate Accel-Sim against contemporary GPUs.Finally, to highlight the effects of falling behind industry, this paper presents two case-studies that demonstrate how incorrect baseline assumptions can hide new areas of opportunity and lead to potentially incorrect design decisions.

[1]  Xun Gong,et al.  Multi2Sim Kepler: A detailed architectural GPU simulator , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[2]  Hsien-Hsin S. Lee,et al.  GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Ronald G. Dreslinski,et al.  Sources of error in full-system simulation , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[4]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[6]  K. Pagiamtzis,et al.  A low-power content-addressable memory (CAM) using pipelined hierarchical search scheme , 2004, IEEE Journal of Solid-State Circuits.

[7]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Wu-chun Feng,et al.  Bounding the effect of partition camping in GPU kernels , 2011, CF '11.

[9]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[10]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[11]  Scott A. Mahlke,et al.  Mascar: Speeding up GPU warps by reducing memory pitstops , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[12]  Yingwei Luo,et al.  Get Out of the Valley: Power-Efficient Address Mapping for GPUs , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[13]  Karthikeyan Sankaralingam,et al.  Architectural Simulators Considered Harmful , 2015, IEEE Micro.

[14]  C. Li,et al.  The Demand for a Sound Baseline in GPU Memory Architecture Research , 2017 .

[15]  Carlos González,et al.  ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[16]  Nam Sung Kim,et al.  Approximating warps with intra-warp operand value similarity , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[17]  Xiaojin Zhu,et al.  Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Doug Burger,et al.  Measuring Experimental Error in Microprocessor Simulation , 2001, ISCA 2001.

[19]  Sarita V. Adve,et al.  Chasing Away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  Mattan Erez,et al.  A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[22]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23]  Oreste Villa,et al.  NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs , 2019, MICRO.

[24]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[25]  Olivier Giroux,et al.  Volta: Performance and Programmability , 2018, IEEE Micro.

[26]  Tor M. Aamodt,et al.  Emerald: Graphics Modeling for SoC Systems , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[27]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[28]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[29]  Won Woo Ro,et al.  Access pattern-aware cache management for improving data utilization in GPU , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[30]  Josep Torrellas,et al.  Scalable Cache Miss Handling for High Memory-Level Parallelism , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[31]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[32]  Onur Mutlu,et al.  DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[33]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[34]  Tor M. Aamodt,et al.  Analyzing Machine Learning Workloads Using a Detailed GPU Simulator , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[35]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Prasun Gera,et al.  Performance Characterisation and Simulation of Intel's Integrated GPU Architecture , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[37]  Tor M. Aamodt,et al.  Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[38]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[39]  A. Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[40]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[41]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[42]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[43]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[44]  Lieven Eeckhout,et al.  Racing to Hardware-Validated Simulation , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[45]  Akshay Jain,et al.  A Quantitative Evaluation of Contemporary GPU Simulation Methodology , 2018, SIGMETRICS.

[46]  Matthew Poremba,et al.  Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[47]  Alois Knoll,et al.  A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[48]  Amar Phanishayee,et al.  Benchmarking and Analyzing Deep Neural Network Training , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[49]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[50]  Rafael Hector Saavedra-Barrera,et al.  CPU performance evaluation and execution time prediction using narrow spectrum benchmarking , 1992 .

[51]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[52]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[53]  AngryCalc GeForce GTX TITAN , 2018 .

[54]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.