Optimizing for KNL Usage Modes When Data Doesn't Fit in MCDRAM

Technologies such as Multi-Channel DRAM (MCDRAM) or High Bandwidth Memory (HBM) provide significantly more bandwidth than conventional memory. This trend has raised questions about how applications should manage data transfers between levels. This paper focuses on evaluating different usage modes of the MCDRAM in Intel Knights Landing (KNL) manycore processors. We evaluate these usage modes with a sorting kernel and a sorting-based streaming benchmark. We develop a performance model for the benchmark and use experimental evidence to demonstrate the correctness of the model. The model projects near-optimal numbers of copy threads for memory bandwidth bound computations. We demonstrate on KNL up to a 1.9X speedup for sort when the problem does not fit in MCDRAM over an OpenMP GNU sort that does not use MCDRAM.

[1]  Lars Koesterke,et al.  Interactive Code Adaptation Tool for Modernizing Applications for Intel Knights Landing Processors , 2017, PEARC.

[2]  Simon David Hammond,et al.  memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. , 2015 .

[3]  Emmanuel Jeannot,et al.  Modeling Large Compute Nodes with Heterogeneous Memories with Cache-Aware Roofline Model , 2017, PMBS@SC.

[4]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[5]  Sanguthevar Rajasekaran,et al.  A Framework for Simple Sorting Algorithms on Parallel Disk Systems , 2001, SPAA '98.

[6]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[7]  Bronis R. de Supinski,et al.  Scaling OpenMP for Exascale Performance and Portability , 2017, Lecture Notes in Computer Science.

[8]  Hao Wang,et al.  Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Gerth Stølting Brodal,et al.  Cache oblivious search trees via binary trees of small height , 2001, SODA '02.

[10]  Gerth Stølting Brodal,et al.  Engineering a cache-oblivious sorting algorithm , 2008, JEAL.

[11]  Peter Sanders,et al.  MCSTL: The Multi-core Standard Template Library , 2007, Euro-Par.

[12]  Michael A. Bender,et al.  Cache-Adaptive Algorithms , 2014, SODA.

[13]  Sarah Tariq,et al.  Interactive fluid-particle simulation using translating Eulerian grids , 2010, I3D '10.

[14]  Sabela Ramos,et al.  Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[15]  Cynthia A. Phillips,et al.  Two-Level Main Memory Co-Design: Multi-threaded Algorithmic Primitives, Analysis, and Simulation , 2015, IPDPS.

[16]  Samuel Williams,et al.  Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor , 2016, ISC Workshops.

[17]  K. Isono,et al.  The physical map of the whole E. coli chromosome: Application of a new strategy for rapid analysis and sorting of a large genomic library , 1987, Cell.

[18]  Johannes Singler,et al.  The GNU libstdc++ parallel mode: software engineering considerations , 2008, IWMSE '08.

[19]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[20]  Bingsheng He,et al.  Relational query coprocessing on graphics processors , 2009, TODS.

[21]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[22]  Cynthia A. Phillips,et al.  Two-Level Main Memory Co-Design: Multi-threaded Algorithmic Primitives, Analysis, and Simulation , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.