Exploiting Data Similarity to Reduce Memory Footprints

Memory size has long limited large-scale applications on high-performance computing (HPC) systems. Since compute nodes frequently do not have swap space, physical memory often limits problem sizes. Increasing core counts per chip and power density constraints, which limit the number of DIMMs per node, have exacerbated this problem. Further, DRAM constitutes a significant portion of overall HPC system cost. Therefore, instead of adding more DRAM to the nodes, mechanisms to manage memory usage more efficiently -- preferably transparently -- could increase effective DRAM capacity and thus the benefit of multicore nodes for HPC systems. MPI application processes often exhibit significant data similarity. These data regions occupy multiple physical locations across the individual rank processes within a multicore node and thus offer a potential savings in memory capacity. These regions, primarily residing in heap, are dynamic, which makes them difficult to manage statically. Our novel memory allocation library, {\it SBLLmallocShort}, automatically identifies identical memory blocks and merges them into a single copy. Our implementation is transparent to the application and does not require any kernel modifications. Overall, we demonstrate that {\it SBLLmalloc} reduces the memory footprint of a range of MPI applications by $32.03\%$ on average and up to $60.87\%$. Further, {\it SBLLmalloc} supports problem sizes for IRS over $21.36\%$ larger than using standard memory management techniques, thus significantly increasing effective system size. Similarly, {\it SBLLmalloc} requires $43.75\%$ fewer nodes than standard memory management techniques to solve an AMG problem.

[1]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[2]  Fred Douglis,et al.  The Compression Cache: Using On-line Compression to Extend Physical Memory , 1993, USENIX Winter.

[3]  Thomas R. Gross,et al.  Adaptive Main Memory Compression , 2005, USENIX Annual Technical Conference, General Track.

[4]  Frederic T. Chong,et al.  Multi-execution: multicore caching for data-similar executions , 2009, ISCA '09.

[5]  Martin Schulz,et al.  PSMalloc: content based memory management for MPI applications , 2009, MEDEA '09.

[6]  Yannis Smaragdakis,et al.  The Case for Compressed Caching in Virtual Memory Systems , 1999, USENIX Annual Technical Conference, General Track.

[7]  Scott Devine,et al.  Disco: running commodity operating systems on scalable multiprocessors , 1997, TOCS.

[8]  Wei Cai,et al.  Scalable Line Dynamics in ParaDiS , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[9]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[10]  George Varghese,et al.  Difference engine , 2010, OSDI.

[11]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[12]  John K. Bennett,et al.  Brazos: a third generation DSM system , 1997 .

[13]  W. Hu,et al.  JIA-JIA : An SVM System Based on A New Cache Coherence Protocol , 1999 .

[14]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[15]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[16]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.

[17]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[18]  Kai Li,et al.  IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.