Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100 (A100), claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI benchmarks, and can we expect the A100 to deliver the application improvements we have grown used to with previous GPU generations? In this paper, we benchmark the A100 GPU and compare it to four previous generations of GPUs, with a particular focus on empirically quantifying our derived performance expectations. We find that the A100 delivers less performance increase than previous generations for the well-known Rodinia benchmark suite; we show that some of these performance anomalies can be remedied through clever use of the new data-movement features, which we microbenchmark and demonstrate where (and more importantly, how) they should be used.

[1]  Matt Martineau,et al.  Benchmarking the NVIDIA V100 GPU and Tensor Cores , 2018, Euro-Par Workshops.

[2]  Laszlo Gyongyosi,et al.  A Survey on quantum computing technology , 2019, Comput. Sci. Rev..

[3]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Mark Bohr,et al.  A 30 Year Retrospective on Dennard's MOSFET Scaling Paper , 2007, IEEE Solid-State Circuits Newsletter.

[5]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[6]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[7]  Gu-Yeon Wei,et al.  Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.

[8]  Terry Cojean,et al.  Evaluating the Performance of NVIDIA’s A100 Ampere GPU for Sparse and Batched Computations , 2020, 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[9]  Satoshi Matsuoka,et al.  Evaluating high-level design strategies on FPGAs for high-performance computing , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[10]  Zhongliang Chen,et al.  NUPAR: A Benchmark Suite for Modern GPU Architectures , 2015, ICPE.

[11]  Satoshi Matsuoka,et al.  From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era , 2016, Conf. Computing Frontiers.

[12]  Shaohuai Shi,et al.  Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training , 2019, 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID).

[13]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[14]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[15]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[16]  Satoshi Matsuoka,et al.  Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws? , 2021, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[17]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[18]  Kentaro Sano,et al.  A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective , 2020, IEEE Access.

[19]  Jack Choquette,et al.  NVIDIA A100 GPU: Performance & Innovation for GPU Computing , 2020, 2020 IEEE Hot Chips 32 Symposium (HCS).

[20]  Terry Cojean,et al.  Evaluating the Performance of NVIDIA's A100 Ampere GPU for Sparse Linear Algebra Computations , 2020, ArXiv.

[21]  Mats Brorsson,et al.  Empowering OpenMP with automatically generated hardware , 2016, 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS).

[22]  Xu Liu,et al.  Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite , 2018, 2018 IEEE International Symposium on Workload Characterization (IISWC).

[23]  Catherine D. Schuman,et al.  A Survey of Neuromorphic Computing and Neural Networks in Hardware , 2017, ArXiv.

[24]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[25]  Satoshi Matsuoka,et al.  Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches? , 2018, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[26]  Christian Plessl,et al.  High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection , 2020, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[27]  R. Schaller,et al.  Moore's law: past, present and future , 1997 .

[28]  Jason Helge Anderson,et al.  LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems , 2013, TECS.

[29]  Dhabaleswar K. Panda,et al.  OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters , 2012, EuroMPI.