Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator

The demands of high-performance computing (HPC) and machine learning (ML) workloads have resulted in the rapid architectural evolution of GPUs over the last decade. The growing memory footprint and diversity of data types in these workloads has required GPUs to embrace micro-architectural heterogeneity and increased memory system sophistication to scale performance. Effective simulation of new architectural features early in the design cycle enables quick and effective exploration of design trade-offs across this increasingly diverse set of workloads. This work provides a retrospective on the design and development of NVArchSim (NVAS), an architectural simulator used within NVIDIA to design and evaluate features that are difficult to appraise using other methodologies due to workload type, size, complexity, or lack of modeling flexibility. We argue that overly precise and/or overly slow architectural models hamper an architect’s ability to evaluate new features within a reasonable time frame, hurting productivity. Because of its speed, NVAS is being used to trace and evaluate hundreds of HPC and state-of-the-art ML workloads on single-GPU or multi-GPU systems. By adding component fidelity only when necessary to improve system-level modeling accuracy, NVAS delivers simulation speed orders of magnitude higher than most publicly available GPU simulators while retaining high levels of accuracy and simulation flexibility. Building trustworthy high-level simulation platforms is a difficult exercise in balance and compromise; we share our experiences to help and encourage those in academia who take on the challenge of building GPU simulation platforms.

[1]  Matthew Poremba,et al.  Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Hsien-Hsin S. Lee,et al.  GPUMech: GPU Performance Modeling Technique Based on Interval Analysis , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Xun Gong,et al.  Multi2Sim Kepler: A detailed architectural GPU simulator , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[4]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Alois Knoll,et al.  A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[6]  Brad Calder,et al.  Basic block distribution analysis to find periodic behavior and simulation points in applications , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[7]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[9]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[10]  Sean White,et al.  ‘Zeppelin’: An SoC for multichip architectures , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[11]  Tor M. Aamodt,et al.  Emerald: Graphics Modeling for SoC Systems , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[12]  C. Li,et al.  The Demand for a Sound Baseline in GPU Memory Architecture Research , 2017 .

[13]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[14]  Sarita V. Adve,et al.  HeteroSync: A benchmark suite for fine-grained synchronization on tightly coupled GPUs , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[15]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[16]  Karthikeyan Sankaralingam,et al.  Your favorite simulator here " Considered Harmful , 2014 .

[17]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[18]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[19]  David W. Nellans,et al.  Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[20]  KimHyesoon,et al.  An integrated GPU power and performance model , 2010 .

[21]  Aamer Jaleel,et al.  Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  David Patterson,et al.  MLPerf Training Benchmark , 2019, MLSys.

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[25]  William J. Dally,et al.  Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[26]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[27]  David W. Nellans,et al.  Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[28]  Simone Secchi,et al.  Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer , 2012, IEEE Transactions on Parallel and Distributed Systems.

[29]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[30]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[31]  Akshay Jain,et al.  A Quantitative Evaluation of Contemporary GPU Simulation Methodology , 2018, SIGMETRICS.

[32]  Louai Alarabi Summit , 2018, SIGSPATIAL Special.

[33]  Aamer Jaleel,et al.  Beyond the Socket: NUMA-Aware GPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[34]  Xiaojin Zhu,et al.  Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[36]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[37]  Carole-Jean Wu,et al.  MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[38]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[39]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[40]  Derek Chiou,et al.  GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[41]  Seth H. Pugsley,et al.  USIMM : the Utah SImulated Memory Module , 2012 .

[42]  Nan Jiang,et al.  A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[43]  Thomas F. Wenisch,et al.  SimFlex: Statistical Sampling of Computer System Simulation , 2006, IEEE Micro.

[44]  David A. Wood,et al.  Full-system timing-first simulation , 2002, SIGMETRICS '02.

[45]  Aamer Jaleel,et al.  HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[46]  Tor M. Aamodt,et al.  Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[47]  Carlos González,et al.  ATTILA: a cycle-level execution-driven simulator for modern GPU architectures , 2006, 2006 IEEE International Symposium on Performance Analysis of Systems and Software.

[48]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[49]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[50]  Oreste Villa,et al.  NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs , 2019, MICRO.

[51]  Bruce Jacob,et al.  DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator , 2020, IEEE Computer Architecture Letters.

[52]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).