ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems

New memory technologies, such as non-volatile memory and stacked memory, have reformed the memory hierarchies in modern and emerging computer architectures. It becomes common to see memories of different types integrated into the same system, as known as heterogeneous memory. Typically, a heterogeneous memory system consists of a small fast component and a large slow component. This encourages new style of data processing and exposes developers with a new problem: given two memory types, how shall we redesign applications to benefit from this memory arrangement and decide on the efficient data placement? Existing methods perform detailed memory access pattern analysis to guide data placement. However, these methods are heavyweight and ignore the interactions between software and hardware. To address these issues, we develop ProfDP, a lightweight profiler that employs differential data-centric analysis to provide intuitive guidance for data placement in heterogeneous memory. Evaluated with a number of parallel benchmarks running on a state-of-the-art emulator and a real machine with heterogeneous memory, we show that ProfDP is able to guide nearly-optimal data placement to maximize performance with minimum programming efforts.

[1]  David Eklov,et al.  Bandwidth Bandit: Quantitative characterization of memory contention , 2012, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[2]  Nathan R. Tallent,et al.  Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Xu Liu,et al.  memif: Towards Programming Heterogeneous Memory Asynchronously , 2016, ASPLOS.

[4]  Jin Xiong,et al.  Exploiting Program Semantics to Place Data in Hybrid Memory , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[5]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[6]  John Shalf,et al.  NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[7]  Nathan R. Tallent,et al.  Performance analysis for parallel programs from multicore to petascale , 2010 .

[8]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[9]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Nathan Froyd,et al.  Scalability analysis of SPMD codes using expectations , 2007, ICS '07.

[11]  Zhen Fang,et al.  Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Jun Li,et al.  Quartz: A Lightweight Performance Emulator for Persistent Memory Software , 2015, Middleware.

[13]  Dong Li,et al.  Identifying Opportunities for Byte-Addressable Non-Volatile Memory in Extreme-Scale Scientific Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[14]  Xu Liu,et al.  Characterizing emerging heterogeneous memory , 2016, ISMM.

[15]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[17]  Martin Dimitrov,et al.  A framework for application guidance in virtual memory systems , 2013, VEE '13.

[18]  Karsten Schwan,et al.  Data tiering in heterogeneous memory systems , 2016, EuroSys.

[19]  Jeffrey S. Vetter,et al.  Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory , 2016, HPDC.

[20]  Dong Li,et al.  PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  M. Lankhorst,et al.  Low-cost and nanoscale non-volatile memory concept for future silicon chips , 2005, Nature materials.

[22]  Dimitrios S. Nikolopoulos,et al.  Software-managed energy-efficient hybrid DRAM/NVM main memory , 2015, Conf. Computing Frontiers.

[23]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[24]  Hao Luo,et al.  HOTL: a higher order theory of locality , 2013, ASPLOS '13.

[25]  Avinash Sodani,et al.  Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[26]  Nathan R. Tallent,et al.  Binary analysis for measurement and attribution of program performance , 2009, PLDI '09.

[27]  Rachata Ausavarungnirun,et al.  Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[28]  Guoyang Chen,et al.  Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU , 2016, ICS.

[29]  Joseph Antony,et al.  Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[30]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[31]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[32]  Paul E. McKenney Differential Profiling , 1999, Softw. Pract. Exp..

[33]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[34]  John M. Mellor-Crummey,et al.  Pinpointing data locality bottlenecks with low overhead , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[35]  Bo Wu,et al.  ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[37]  Simon David Hammond,et al.  memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. , 2015 .

[38]  Gokcen Kestor,et al.  RTHMS: a tool for data placement on hybrid memory system , 2017, ISMM.