How to Manage High-Bandwidth Memory Automatically

This paper develops an algorithmic foundation for automated management of the multilevel-memory systems common to new supercomputers. In particular, the High-Bandwidth Memory (HBM) of these systems has a similar latency to that of DRAM and a smaller capacity, but it has much larger bandwidth. Systems equipped with HBM do not fit in classic memory-hierarchy models due to HBM's atypical characteristics. Unlike caches, which are generally managed automatically by the hardware, programmers of some current HBM-equipped supercomputers can choose to explicitly manage HBM themselves. This process is problem specific and resource intensive. Vendors offer this option because there is no consensus on how to automatically manage HBM to guarantee good performance, or whether this is even possible. In this paper, we give theoretical support for automatic HBM management by developing simple algorithms that can automatically control HBM and deliver good performance on multicore systems. HBM management is starkly different from traditional caching both in terms of optimization objectives and algorithm development. Since DRAM and HBM have similar latencies, minimizing HBM misses (provably) turns out not to be the right memory-management objective. Instead, we directly focus on minimizing makespan. In addition, while cache-management algorithms must focus on what pages to keep in cache; HBM management requires answering two questions: (1) which pages to keep in HBM and (2) how to use the limited bandwidth from HBM to DRAM. It turns out that the natural approach of using LRU for the first question and FCFS (First-Come-First-Serve) for the second question is provably bad. Instead, we provide a priority based approach that is simple, efficiently implementable and $O(1)$-competitive for makespan when all multicore threads are independent.

[1]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[2]  Enoch Peserico,et al.  Paging with dynamic memory capacity , 2013, STACS.

[3]  Hao Wang,et al.  Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Avinatan Hassidim,et al.  Cache Replacement Policies for Multicore Processors , 2010, ICS.

[5]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[6]  Douglas W. Doerfler Trinity: Next-Generation Supercomputer for the ASC Program. , 2014 .

[7]  Amos Fiat,et al.  Competitive Paging Algorithms , 1991, J. Algorithms.

[8]  Cynthia A. Phillips,et al.  k-Means Clustering on Two-Level Memory Systems , 2015, MEMSYS.

[9]  Michael A. Bender,et al.  New algorithms for the disk scheduling problem , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[10]  Stephen L. Olivier,et al.  Optimizing for KNL Usage Modes When Data Doesn't Fit in MCDRAM , 2018, ICPP.

[11]  Jeremy Kepner,et al.  Benchmarking data analysis and machine learning applications on the Intel KNL many-core processor , 2017, 2017 IEEE High Performance Extreme Computing Conference (HPEC).

[12]  Richard Cole,et al.  Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers: Extended Abstract , 2017, SPAA.

[13]  Cynthia A. Phillips,et al.  Two-Level Main Memory Co-Design: Multi-threaded Algorithmic Primitives, Analysis, and Simulation , 2015, IPDPS.

[14]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Pramod Ganapathi,et al.  Toward Efficient Architecture-Independent Algorithms for Dynamic Programs , 2019, ISC.

[16]  Esteban Feuerstein,et al.  On-Line Multi-Threaded Paging , 2001, Algorithmica.

[17]  Rakesh D. Barve,et al.  External Memory Algorithms with Dynamically Changing Memory Allocations . , 1998 .

[18]  Alejandro Strejilevich de Loma New results on fair multi-threaded paging. , 1998 .

[19]  Alejandro López-Ortiz,et al.  Paging for multi-core shared caches , 2012, ITCS '12.

[20]  Michael T. Goodrich,et al.  Fundamental parallel algorithms for private-cache chip multiprocessors , 2008, SPAA '08.

[21]  Helen Xu,et al.  Cache-Adaptive Exploration: Experimental Results and Scan-Hiding for Adaptivity , 2018, SPAA.

[22]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[23]  Allan Borodin,et al.  Competitive paging with locality of reference , 1991, STOC '91.

[24]  Michael A. Bender,et al.  Green Paging and Parallel Paging , 2020, SPAA.

[25]  Michael A. Bender,et al.  Cache-Adaptive Algorithms , 2014, SODA.

[26]  Michael A. Bender,et al.  Small Refinements to the DAM Can Have Big Consequences for Data-Structure Design , 2019, SPAA.

[27]  Sivasankaran Rajamanickam,et al.  Experimental Design of Work Chunking for Graph Algorithms on High Bandwidth Memory Architectures , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[28]  Michael A. Bender,et al.  Cache-Adaptive Analysis , 2013, SPAA.

[29]  Steven S. Seiden,et al.  Randomized Online Multi-Threaded Paging , 1999, Nord. J. Comput..

[30]  Steven Skiena,et al.  Data Races and the Discrete Resource-time Tradeoff Problem with Resource Reuse over Paths , 2019, SPAA.

[31]  Jeff Nichols,et al.  Announcing Supercomputer Summit , 2016 .

[32]  Cynthia A. Phillips,et al.  Two-Level Main Memory Co-Design: Multi-threaded Algorithmic Primitives, Analysis, and Simulation , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[33]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[34]  John Shalf,et al.  Exascale Computing Trends: Adjusting to the "New Normal"' for Computer Architecture , 2013, Computing in Science & Engineering.

[35]  Robert E. Tarjan,et al.  Amortized efficiency of list update and paging rules , 1985, CACM.

[36]  Michele Scquizzato,et al.  Brief Announcement: Green Paging and Parallel Paging , 2020 .

[37]  Guy E. Blelloch,et al.  Scheduling threads for constructive cache sharing on CMPs , 2007, SPAA '07.

[38]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[39]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[40]  Michael A. Bender,et al.  Closing the Gap Between Cache-oblivious and Cache-adaptive Analysis , 2020, SPAA.

[41]  Vijaya Ramachandran,et al.  Competitive Cache Replacement Strategies for Shared Cache Environments , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.