NUMA-aware memory manager with dominant-thread-based copying GC

We propose a novel online method of identifying the preferred NUMA nodes for objects with negligible overhead during the garbage collection time as well as object allocation time. Since the number of CPUs (or NUMA nodes) is increasing recently, it is critical for the memory manager of the runtime environment of an object-oriented language to exploit the low latency of local memory for high performance. To locate the CPU of a thread that frequently accesses an object, prior research uses the runtime information about memory accesses as sampled by the hardware. However, the overhead of this approach is high for a garbage collector. Our approach uses the information about which thread can exclusively access an object, or the Dominant Thread (DoT). The dominant thread of an object is the thread that often most accesses an object so that we do not require memory access samples. Our NUMA-aware GC performs DoT based object copying, which copies each live object to the CPU where the dominant thread was last dispatched before GC. The dominant thread information is known from the thread stack and from objects that are locked or reserved by threads and is propagated in the object reference graph. We demonstrate that our approach can improve the performance of benchmark programs such as SPECpower ssj2008, SPECjbb2005, and SPECjvm2008.We prototyped a NUMAaware memory manager on a modified version of IBM Java VM and tested it on a cc-NUMA POWER6 machine with eight NUMA nodes. Our NUMA-aware GC achieved performance improvements up to 14.3% and 2.0% on average over a JVM that only used the NUMA-aware allocator. The total improvement using both the NUMA-aware allocator and GC is up to 53.1% and 10.8% on average.

[1]  Kiyokuni Kawachiya,et al.  Lock reservation: Java locks can mostly do without atomic operations , 2002, OOPSLA '02.

[2]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[3]  Rajiv Arora,et al.  Java server performance: A case study of building efficient, scalable Jvms , 2000, IBM Syst. J..

[4]  Rajesh Bordawekar,et al.  Exploiting prolific types for memory management and optimizations , 2002, POPL '02.

[5]  Mark E. Dean,et al.  Windows NT in a ccNUMA system , 1999 .

[6]  Henry Lieberman,et al.  A real-time garbage collector based on the lifetimes of objects , 1983, CACM.

[7]  Myra B. Cohen,et al.  Clustering the heap in multi-threaded applications for improved garbage collection , 2006, GECCO.

[8]  Dirk Schmidl,et al.  Data and thread affinity in openmp programs , 2008, MAW '08.

[9]  Frank Mueller,et al.  Hardware profile-guided automatic page placement for ccNUMA systems , 2006, PPoPP '06.

[10]  Erez Petrank,et al.  Thread-local heaps for Java , 2002, MSP/ISMM.

[11]  Mauricio J. Serrano,et al.  Thin locks: featherweight synchronization for Java , 1998, PLDI '98.

[12]  David Detlefs,et al.  Eliminating synchronization-related atomic operations with biased locking and bulk rebiasing , 2006, OOPSLA '06.

[13]  Martin Hirzel,et al.  Data layouts for object-oriented programs , 2007, SIGMETRICS '07.

[14]  Nikola Grcevski,et al.  Java Just-in-Time Compiler and Virtual Machine Improvements for Server and Middleware Applications , 2004, Virtual Machine Research and Technology Symposium.

[15]  Samuel P. Midkiff,et al.  Practical escape analyses: how good are they? , 2007, VEE '07.

[16]  Samuel P. Midkiff,et al.  A two-phase escape analysis for parallel Java programs , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Jeffrey K. Hollingsworth,et al.  Using Hardware Counters to Automatically Improve Memory Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[18]  David Ungar Generation scavenging: a nondisruptive high performance storage reclamation algorithm , 1984 .

[19]  Bjarne Steensgaard,et al.  Thread-specific heaps for multi-threaded programs , 2000, ISMM '00.

[20]  Jeffrey K. Hollingsworth,et al.  NUMA-aware Java heaps for server applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[21]  Toshio Nakatani,et al.  TO-Lock: removing lock overhead using the owners' temporal locality , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..