Memory Bandwidth Contention: Communication vs Computation Tradeoffs in Supercomputers with Multicore Architectures

We study the problem of contention for memory bandwidth between computation and communication in supercomputers that feature multicore CPUs. The problem arises when communication and computation are overlapped, and both operations compete for the same memory bandwidth. This contention is most visible at the limits of scalability, when communication and computation take similar amounts of time, and thus must be taken into account in order to reach maximum scalability in memory bandwidth bound applications. Typical examples of codes affected by the memory bandwidth contention problem are sparse matrix-vector computations, graph algorithms, and many machine learning problems, as they typically exhibit a high demand for both memory bandwidth and inter-node communication, while performing a relatively low number of arithmetic operations. The problem is even more relevant in truly heterogeneous computations where CPUs and accelerators are used in concert. In that case it can lead to mispredictions of expected performance and consequently to suboptimal load balancing between CPU and accelerator, which in turn can lead to idling of powerful accelerators and thus to a large decrease in performance. We propose a simple benchmark in order to quantify the loss of performance due to memory bandwidth contention. Based on that, we derive a theoretical model to determine the impact of the phenomenon on parallel memory-bound applications. We test the model on scientific computations, discuss the practical relevance of the problem and suggest possible techniques to remedy it.

[1]  Scott B. Baden,et al.  Scalable Heterogeneous CPU-GPU Computations for Unstructured Tetrahedral Meshes , 2015, IEEE Micro.

[2]  Nan Wu,et al.  Parallel performance modeling of irregular applications in cell-centered finite volume methods over unstructured tetrahedral meshes , 2015, J. Parallel Distributed Comput..

[3]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[5]  Dhabaleswar K. Panda,et al.  Microbenchmark performance comparison of high-speed cluster interconnects , 2004, IEEE Micro.

[6]  Scott B. Baden,et al.  Overlapping communication and computation with OpenMP and MPI , 2001, Sci. Program..

[7]  Daniel A. Menascé,et al.  Predicting the Effect of Memory Contention in Multi-Core Computers Using Analytic Performance Models , 2015, IEEE Transactions on Computers.

[8]  Brice Goglin,et al.  A benchmark-based performance model for memory-bound HPC applications , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[9]  Odysseas I. Pentakalos An Introduction to the InfiniBand Architecture , 2002, Int. CMG Conference.

[10]  Scott B. Baden,et al.  Panda: A Compiler Framework for Concurrent CPU+\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+$$\end{document}GPU Ex , 2016, International Journal of Parallel Programming.

[11]  Xing Cai,et al.  Heterogeneous CPU-GPU computing for the finite volume method on 3D unstructured meshes , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[12]  Jun Zhou,et al.  Multi-GPU Implementation of a 3D Finite Difference Time Domain Earthquake Code on Heterogeneous Supercomputers , 2013, ICCS.

[13]  Xingfu Wu,et al.  Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers , 2011, J. Comput. Syst. Sci..

[14]  Samuel Williams,et al.  Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor , 2016, ISC Workshops.

[15]  Richard W. Vuduc,et al.  A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method , 2014, GPGPU@ASPLOS.

[16]  C. Aykanat Hypergraph Model for Mapping Repeated Sparse-Matrix Vector Product Computations onto Multicomputers , 1995 .

[17]  David A. Bader,et al.  A Methodology for Co-Location Aware Application Performance Modeling in Multicore Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[18]  Pavan Balaji,et al.  Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.