A new approach to parallelising tracing algorithms

Tracing algorithms visit reachable nodes in a graph and are central to activities such as garbage collection, marshaling etc. Traditional sequential algorithms use a worklist, replacing a nodes with their unvisited children. Previous work on parallel tracing is processor-oriented in associating one worklist per processor: worklist insertion and removal requires no locking, and load balancing requires only occasional locking. However, since multiple queues may contain the same node, significant locking is necessary to avoid concurrent visits by competing processors. This paper presents a memory-oriented solution: memory is partitioned into segments and each segment has its own worklist containing only nodes in that segment. At a given time at most one processor owns a given worklist. By arranging separate single-reader-single-writer forwarding queues to pass nodes from processor i to processor j we can process objects in an order that gives lock-free mainline code and improved locality of reference. This refactoring is analogous to the way in which a compiler changes an iteration space to eliminate data dependencies. While it is clear that our solution can be more effective on NUMA systems and even necessary when processor-local memory may not be addressed from other processors, slightly surprisingly, it often gives significantly better speed-up on modern multi-cores architectures too. Using caches to hide memory latency loses much of its effectiveness when there is significant cross-processor memory contention or when locking is necessary.

[1]  Akinori Yonezawa,et al.  A Scalable Mark-Sweep Garbage Collector on Large-Scale Shared-Memory Machines , 1997, SC.

[2]  Perry Cheng,et al.  Oil and water? High performance garbage collection in Java with MMTk , 2004, Proceedings. 26th International Conference on Software Engineering.

[3]  Evan Tick,et al.  Evaluation of Parallel Copying Garbage Collection on a Shared-Memory Multiprocessor , 1993, IEEE Trans. Parallel Distributed Syst..

[4]  Giuseppe Attardi,et al.  A Customisable Memory Management Framework , 1994, C++ Conference.

[5]  Marc Shapiro,et al.  A Survey of Distributed Garbage Collection Techniques , 1995, IWMM.

[6]  Rafael Dueire Lins,et al.  Garbage collection: algorithms for automatic dynamic memory management , 1996 .

[7]  David L Weaver,et al.  The SPARC architecture manual : version 9 , 1994 .

[8]  Andrew W. Appel,et al.  Creating and preserving locality of java applications at allocation and garbage collection times , 2002, OOPSLA '02.

[9]  Emery D. Berger,et al.  Garbage collection without paging , 2005, PLDI '05.

[10]  Damien Doligez,et al.  A concurrent, generational garbage collector for a multithreaded implementation of ML , 1993, POPL '93.

[11]  Simon L. Peyton Jones,et al.  Parallel generational-copying garbage collection with a block-structured heap , 2008, ISMM '08.

[12]  Hans-Juergen Boehm,et al.  Reducing garbage collector cache misses , 2000, ISMM '00.

[13]  Nir Shavit,et al.  Parallel Garbage Collection for Shared Memory Multiprocessors , 2001, Java Virtual Machine Research and Technology Symposium.

[14]  Stephen M. Watt,et al.  A Localized Tracing Scheme Applied to Garbage Collection , 2006, APLAS.

[15]  James R. Larus,et al.  A concurrent copying garbage collector for languages that distinguish (im)mutable data , 1993, PPOPP '93.

[16]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[17]  C. Richard Attanasio,et al.  A Comparative Evaluation of Parallel Garbage Collector Implementations , 2001, LCPC.

[18]  H. G. Baker,et al.  ACTOR SYSTEMS FOR REAL-TIME COMPUTATION , 1978 .

[19]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[20]  Daniel G. Bobrow,et al.  Combining generational and conservative garbage collection: framework and implementations , 1989, POPL '90.

[21]  Martin Hirzel,et al.  Improving locality with parallel hierarchical copying GC , 2006, ISMM '06.

[22]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[23]  A. Rosser A.I.D.S. , 1986, Maryland medical journal.

[24]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[25]  Guy E. Blelloch,et al.  A parallel, real-time garbage collector , 2001, PLDI '01.

[26]  Erez Petrank,et al.  A parallel, incremental, mostly concurrent garbage collector for servers , 2005, TOPL.

[27]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA.

[28]  Chris J. Cheney A nonrecursive list compacting algorithm , 1970, Commun. ACM.