论文信息 - MGSim + MGMark: A Framework for Multi-GPU System Research

MGSim + MGMark: A Framework for Multi-GPU System Research

The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units). As single-GPU systems struggle to satisfy the performance demands, multi-GPU systems have begun to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabrics, runtime libraries and associated programming models. The research community currently lacks a publically available and comprehensive multi-GPU simulation framework and benchmark suite to evaluate multi-GPU system design solutions. In this work, we present MGSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. We complement MGSim with MGMark, a suite of multi-GPU workloads that explores multi-GPU collaborative execution patterns. Our simulator is scalable and comes with in-built support for multi-threaded execution to enable fast and efficient simulations. In terms of performance accuracy, MGSim differs $5.5\%$ on average when compared against actual GPU hardware. We also achieve a $3.5\times$ and a $2.5\times$ average speedup in function emulation and architectural simulation with 4 CPU cores, while delivering the same accuracy as the serial simulation. We illustrate the novel simulation capabilities provided by our simulator through a case study exploring programming models based on a unified multi-GPU system~(U-MGPU) and a discrete multi-GPU system~(D-MGPU) that both utilize unified memory space and cross-GPU memory access. We evaluate the design implications from our case study, suggesting that D-MGPU is an attractive programming model for future multi-GPU systems.

[1] Y. Lim,et al. FIR filter design over a discrete powers-of-two coefficient space , 1983 .

[2] Xiangyu Li,et al. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[3] Roy H. Campbell,et al. A Parallel Implementation of K-Means Clustering on GPUs , 2008, PDPTA.

[4] Larry J. Merville,et al. An Empirical Examination of the Black‐Scholes Call Option Pricing Model , 1979 .

[5] Sangpil Lee,et al. Parallel GPU Architecture Simulation Framework Exploiting Architectural-Level Parallelism with Timing Error Prediction , 2016, IEEE Transactions on Computers.

[6] David R. Kaeli,et al. Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[7] Yunsi Fei,et al. Nacre: Durable, Secure and Energy-efficient Non-Volatile Memory Utilizing Data Versioning , 2019, IEEE Transactions on Emerging Topics in Computing.

[8] Aamer Jaleel,et al. Beyond the Socket: NUMA-Aware GPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9] David Defour,et al. Barra, a Parallel Functional GPGPU Simulator , 2009 .

[10] Michael Garland,et al. Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11] Hai Jiang,et al. Scaling up MapReduce-based Big Data Processing on Multi-GPU systems , 2014, Cluster Computing.

[12] Matthew Poremba,et al. Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13] Robert C. Martin. Agile Software Development, Principles, Patterns, and Practices , 2002 .

[14] David Kanter. GRAPHICS PROCESSING REQUIREMENTS FOR ENABLING IMMERSIVE VR , 2015 .

[15] Antonio J. Peña,et al. Chai: Collaborative heterogeneous applications for integrated-architectures , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16] Wen-mei W. Hwu,et al. Heterogeneous System Architecture: A New Compute Platform Infrastructure , 2015 .

[17] R. M. Fujimoto,et al. Parallel discrete event simulation , 1989, WSC '89.

[18] Brian Kingsbury,et al. Very deep multilingual convolutional neural networks for LVCSR , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[20] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[21] Carole-Jean Wu,et al. MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[22] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[23] John Waldron,et al. AES Encryption Implementation and Analysis on Commodity Graphics Processing Units , 2007, CHES.

[24] John E. Stone,et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[25] Simon See,et al. An Evaluation of Unified Memory Technology on NVIDIA GPUs , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[26] Eric A. Brewer,et al. Kubernetes and the path to cloud native , 2015, SoCC.

[27] Dietmar Fey,et al. High Performance Stencil Code Algorithms for GPGPUs , 2011, ICCS.

[28] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[29] Keshav Pingali,et al. Stochastic gradient descent on GPUs , 2015, GPGPU@PPoPP.

[30] Abhinav Vishnu,et al. Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[31] Alessandro Dal Palù,et al. GPU-enhanced Finite Volume Shallow Water solver for fast flood simulations , 2014, Environ. Model. Softw..

[32] Xun Gong,et al. Multi2Sim Kepler: A detailed architectural GPU simulator , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[33] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[34] Denis Foley,et al. Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.

[35] Shengen Yan,et al. Deep Image: Scaling up Image Recognition , 2015, ArXiv.

[36] Smruti R. Sarangi,et al. GpuTejas: A parallel simulator for GPU architectures , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[37] Eugenio Culurciello,et al. An Analysis of Deep Neural Network Models for Practical Applications , 2016, ArXiv.

[38] Brad Calder,et al. Reproducible simulation of multi-threaded workloads for architecture design exploration , 2008, 2008 IEEE International Symposium on Workload Characterization.

[39] Amnon Barak,et al. Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[40] Wei Wu,et al. Fast thermal simulation for architecture level dynamic thermal management , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[41] Won Woo Ro,et al. Parallel GPU architecture simulation framework exploiting work allocation unit parallelism , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[42] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[43] Joonyoung Kim,et al. HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[44] David R. Kaeli,et al. UMH , 2016, ACM Trans. Archit. Code Optim..

[45] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).