MGSim + MGMark: A Framework for Multi-GPU System Research

The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units). As single-GPU systems struggle to satisfy the performance demands, multi-GPU systems have begun to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabrics, runtime libraries and associated programming models. The research community currently lacks a publically available and comprehensive multi-GPU simulation framework and benchmark suite to evaluate multi-GPU system design solutions. In this work, we present MGSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. We complement MGSim with MGMark, a suite of multi-GPU workloads that explores multi-GPU collaborative execution patterns. Our simulator is scalable and comes with in-built support for multi-threaded execution to enable fast and efficient simulations. In terms of performance accuracy, MGSim differs $5.5\%$ on average when compared against actual GPU hardware. We also achieve a $3.5\times$ and a $2.5\times$ average speedup in function emulation and architectural simulation with 4 CPU cores, while delivering the same accuracy as the serial simulation. We illustrate the novel simulation capabilities provided by our simulator through a case study exploring programming models based on a unified multi-GPU system~(U-MGPU) and a discrete multi-GPU system~(D-MGPU) that both utilize unified memory space and cross-GPU memory access. We evaluate the design implications from our case study, suggesting that D-MGPU is an attractive programming model for future multi-GPU systems.

[1]  Y. Lim,et al.  FIR filter design over a discrete powers-of-two coefficient space , 1983 .

[2]  Xiangyu Li,et al.  Hetero-mark, a benchmark suite for CPU-GPU collaborative computing , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Roy H. Campbell,et al.  A Parallel Implementation of K-Means Clustering on GPUs , 2008, PDPTA.

[4]  Larry J. Merville,et al.  An Empirical Examination of the Black‐Scholes Call Option Pricing Model , 1979 .

[5]  Sangpil Lee,et al.  Parallel GPU Architecture Simulation Framework Exploiting Architectural-Level Parallelism with Timing Error Prediction , 2016, IEEE Transactions on Computers.

[6]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7]  Yunsi Fei,et al.  Nacre: Durable, Secure and Energy-efficient Non-Volatile Memory Utilizing Data Versioning , 2019, IEEE Transactions on Emerging Topics in Computing.

[8]  Aamer Jaleel,et al.  Beyond the Socket: NUMA-Aware GPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  David Defour,et al.  Barra, a Parallel Functional GPGPU Simulator , 2009 .

[10]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Hai Jiang,et al.  Scaling up MapReduce-based Big Data Processing on Multi-GPU systems , 2014, Cluster Computing.

[12]  Matthew Poremba,et al.  Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Robert C. Martin Agile Software Development, Principles, Patterns, and Practices , 2002 .

[14]  David Kanter GRAPHICS PROCESSING REQUIREMENTS FOR ENABLING IMMERSIVE VR , 2015 .

[15]  Antonio J. Peña,et al.  Chai: Collaborative heterogeneous applications for integrated-architectures , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16]  Wen-mei W. Hwu,et al.  Heterogeneous System Architecture: A New Compute Platform Infrastructure , 2015 .

[17]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[18]  Brian Kingsbury,et al.  Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[20]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[21]  Carole-Jean Wu,et al.  MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[22]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[23]  John Waldron,et al.  AES Encryption Implementation and Analysis on Commodity Graphics Processing Units , 2007, CHES.

[24]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[25]  Simon See,et al.  An Evaluation of Unified Memory Technology on NVIDIA GPUs , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[26]  Eric A. Brewer,et al.  Kubernetes and the path to cloud native , 2015, SoCC.

[27]  Dietmar Fey,et al.  High Performance Stencil Code Algorithms for GPGPUs , 2011, ICCS.

[28]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[29]  Keshav Pingali,et al.  Stochastic gradient descent on GPUs , 2015, GPGPU@PPoPP.

[30]  Abhinav Vishnu,et al.  Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[31]  Alessandro Dal Palù,et al.  GPU-enhanced Finite Volume Shallow Water solver for fast flood simulations , 2014, Environ. Model. Softw..

[32]  Xun Gong,et al.  Multi2Sim Kepler: A detailed architectural GPU simulator , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[33]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[34]  Denis Foley,et al.  Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.

[35]  Shengen Yan,et al.  Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[36]  Smruti R. Sarangi,et al.  GpuTejas: A parallel simulator for GPU architectures , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[37]  Eugenio Culurciello,et al.  An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[38]  Brad Calder,et al.  Reproducible simulation of multi-threaded workloads for architecture design exploration , 2008, 2008 IEEE International Symposium on Workload Characterization.

[39]  Amnon Barak,et al.  Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Wei Wu,et al.  Fast thermal simulation for architecture level dynamic thermal management , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[41]  Won Woo Ro,et al.  Parallel GPU architecture simulation framework exploiting work allocation unit parallelism , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[42]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[43]  Joonyoung Kim,et al.  HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[44]  David R. Kaeli,et al.  UMH , 2016, ACM Trans. Archit. Code Optim..

[45]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).