Scaling up parallel GC work-stealing in many-core environments

Parallel copying garbage collection (GC) is widely used in the de facto Java virtual machines such as OpenJDK and OpenJ9. OpenJDK uses work-stealing for copying objects in the Parallel GC and Garbage-First (G1) GC policies to balance the copying task among GC threads. When a thread has no task in its own queue, it tries to steal a task from another thread's queue as a thief. When a thief succeeds in stealing a task, it processes the task and enqueues the children of the task into its queue, which is accessible from other thieves.Unfortunately, the overhead of the work-stealing framework becomes non-negligible when we aim to achieve a minimum GC pause time by increasing the number of GC threads. Since the number of tasks processed per thread decreases, thieves frequently try to steal tasks from others at a low success rate. When a thief fails in steals continuously, it needs to wait in a spin loop on the termination protocol of the work-stealing framework. Spinning in a loop frequently results in high CPU utilization, which is not acceptable in a large-scale data center where severe power management is required. This paper proposes two approaches named steal-best-of-many selection and spin-less termination to reduce the overhead in the work-stealing framework. Steal-best-of-many selection reduces steal failures by changing the number of queue selections to steal in accordance with the number of GC threads. Spin-less termination moves a part of the object copies into a spin loop by changing the procedure of copying GC. It reduces part of the GC pause time for the object copy as well as the CPU utilization for the spin loop. We developed a prototype on OpenJDK8 and evaluated it using SPECjbb2015 and SPECjvm2008 benchmarks. Critical-jOPS performance of SPECjbb2015 improved by 18% at maximum and scores of the SPECjvm2008 benchmarks improved by 1-5%.

[1]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[2]  Sriram Krishnamoorthy,et al.  Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing , 2008, 2008 37th International Conference on Parallel Processing.

[3]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[4]  Wessam Hassanein Understanding and improving JVM GC work stealing at the data center scale , 2016, ISMM.

[5]  Jean-Luc Gaudiot,et al.  Mark-Sharing: A Parallel Garbage Collection Algorithm for Low Synchronization Overhead , 2013, 2013 International Conference on Parallel and Distributed Systems.

[6]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[7]  David Cunningham,et al.  Resilient X10: efficient failure-aware programming , 2014, PPoPP '14.

[8]  Nir Shavit,et al.  Parallel Garbage Collection for Shared Memory Multiprocessors , 2001, Java Virtual Machine Research and Technology Symposium.

[9]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10]  David Cunningham,et al.  X10 and APGAS at Petascale , 2016, ACM Trans. Parallel Comput..

[11]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[12]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.

[13]  Jaco van de Pol,et al.  Lace: Non-blocking Split Deque for Work-Stealing , 2014, Euro-Par Workshops.

[14]  Erik Helin Improving Load Balancing during the Marking Phase of Garbage Collection. , 2012 .

[15]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[16]  Toshio Endo,et al.  A Scalable Mark-Sweep Garbage Collector on Large-Scale Shared-Memory Machines , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[17]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[18]  Witawas Srisa-an,et al.  Characterizing and optimizing hotspot parallel garbage collection on multicore systems , 2018, EuroSys.

[19]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.