Trash Talk: Accelerating Garbage Collection on Integrated GPUs is Worthless

Systems integrating heterogeneous processors with unified memory provide seamless integration among these processors with minimal development complexity. These systems integrate accelerators such as GPUs on the same die with CPU cores to accommodate running parallel applications with varying levels of parallelism. Such integration is becoming very common on modern chip architectures, and it places a burden (or opportunity) on application and system programmers to utilize the full potential of such integrated chips. In this paper we evaluate whether we can obtain any performance benefits from running garbage collection on integrated GPU systems, and discuss how difficult it would be to realize these gains for the programmer. Proliferation of garbage-collected languages running on a variety of platforms from handheld mobile devices to data centers makes garbage collection an interesting target to examine on such platforms and can offer valuable lessons for other applications. We present our analysis of running garbage collection on integrated systems and find that the current state of these systems does not provide an advantage for accelerating such a task. We build a framework that allows us to offload garbage collection tasks on integrated GPU systems from within the JVM. We identify dominant phases of garbage collection and study the viability of offloading them to the integrated GPU. We show that performance advantages are limited, partly because an integrated GPU has limited advantage in memory bandwidth over the CPU, and partly because of costly atomic operations.

[1]  Rubao Lee,et al.  Spark-GPU: An accelerated in-memory data processing engine on clusters , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[2]  H. Howie Huang,et al.  Enterprise: breadth-first graph traversal on GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Lieven Eeckhout,et al.  Boosting the Priority of Garbage , 2016, ACM Trans. Archit. Code Optim..

[4]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[5]  Alexander Aiken,et al.  Singe: leveraging warp specialization for high performance on GPUs , 2014, PPoPP '14.

[6]  David Detlefs,et al.  Garbage-first garbage collection , 2004, ISMM '04.

[7]  Tarek Abdelrahman,et al.  Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems , 2020, ICPP.

[8]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[9]  Alexandra Fedorova,et al.  Analyzing memory management methods on integrated CPU-GPU systems , 2017, ISMM.

[10]  Emery D. Berger,et al.  Garbage collection without paging , 2005, PLDI '05.

[11]  Marc Shapiro,et al.  Assessing the scalability of garbage collectors on many cores , 2011, PLOS '11.

[12]  Simon L. Peyton Jones,et al.  Parallel generational-copying garbage collection with a block-structured heap , 2008, ISMM '08.

[13]  John Kubiatowicz,et al.  GPUs as an opportunity for offloading garbage collection , 2012, ISMM '12.

[14]  Mayank Daga,et al.  Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[15]  Rafael Asenjo,et al.  Parallel multiprocessing and scheduling on the heterogeneous Xeon+FPGA platform , 2019, The Journal of Supercomputing.

[16]  Paulo Ferreira,et al.  NG2C: Pretenuring N-Generational GC for HotSpot Big Data Applications , 2017, ArXiv.

[17]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[18]  Henry Lieberman,et al.  A real-time garbage collector based on the lifetimes of objects , 1983, CACM.

[19]  R.H. Dennard,et al.  Design Of Ion-implanted MOSFET's with Very Small Physical Dimensions , 1974, Proceedings of the IEEE.

[20]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[21]  John Kubiatowicz,et al.  Trash Day: Coordinating Garbage Collection in Distributed Systems , 2015, HotOS.

[22]  Mathias Payer,et al.  Impact of GC design on power and performance for Android , 2015, SYSTOR.

[23]  Lu Fang,et al.  Yak: A High-Performance Big-Data-Friendly Garbage Collector , 2016, OSDI.

[24]  Nhan Nguyen,et al.  NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines , 2015, ASPLOS.

[25]  Haibo Chen,et al.  Performance Analysis and Optimization of Full Garbage Collection in Memory-hungry Environments , 2016, VEE.

[26]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[27]  Michael Philippsen,et al.  Iterative data-parallel mark&sweep on a GPU , 2011, ISMM '11.

[28]  Rupesh Nasre,et al.  FastCollect: Offloading Generational Garbage Collection to integrated GPUs , 2016, 2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES).

[29]  Emery D. Berger,et al.  Quantifying the performance of garbage collection vs. explicit memory management , 2005, OOPSLA '05.

[30]  Marc Shapiro,et al.  A study of the scalability of stop-the-world garbage collectors on multicores , 2013, ASPLOS '13.

[31]  John Kubiatowicz,et al.  Taurus: A Holistic Language Runtime System for Coordinating Distributed Managed-Language Applications , 2016, ASPLOS.

[32]  Rafael Asenjo,et al.  Heterogeneous parallel_for Template for CPU–GPU Chips , 2018, International Journal of Parallel Programming.

[33]  Perry Cheng,et al.  Myths and realities: the performance impact of garbage collection , 2004, SIGMETRICS '04/Performance '04.