Estimating the Performance Impact of the MCDRAM on KNL Using Dual-Socket Ivy Bridge Nodes on Cray XC 30

NERSC is preparing for its next petascale system, named Cori, a Cray XC system based on the Intel KNL MIC architecture. Each Cori node will have 72 cores (288 threads), 512 bit vector units, and a low capacity (16GB) and high bandwidth (~5x DDR4) on-package memory (MCDRAM or HBM). To help applications get ready for Cori, NERSC has developed optimization strategies that focus on the MPI+OpenMP program model, vectorization, and the HBM. While the optimization on MPI+OpenMP and vectorization can be carried out on today’s multi-core architectures, optimization of the HBM is difficult to perform where the HBM is unavailable. In this paper, we will present our HBM performance analysis on the VASP code, a widely used materials science code, using Intel's development tools, Memkind and AutoHBW, and a dual-socket Ivy Bridge processor node on Edison, a Cray XC30, as a proxy to the HBM on KNL. Keywords-HBM; MCDRAM; KNL; VASP; memory bandwidth; Memkind; AutoHBW; performance