Memory management for many-core processors with software configurable locality policies

As processors evolve towards higher core counts, architects will develop more sophisticated memory systems to satisfy the cores' increasing thirst for memory bandwidth. Early many-core processor designs suggest that future memory systems will likely include multiple controllers and distributed cache coherence protocols. Many-core processors that expose memory locality policies to the software system provide opportunities for automatic tuning that can achieve significant performance benefits. Managed languages typically provide a simple heap abstraction. This paper presents techniques that bridge the gap between the simple heap abstraction of modern languages and the complicated memory systems of future processors. We present a NUMA-aware approach to garbage collection that balances the competing concerns of data locality and heap utilization to improve performance. We combine a lightweight approach for measuring an application's memory behavior with an online, adaptive algorithm for tuning the cache to optimize it for the specific application's behaviors. We have implemented our garbage collector and cache tuning algorithm and present results on a 64-core TILEPro64 processor.

[1]  Kathryn S. McKinley,et al.  Hoard: a scalable memory allocator for multithreaded applications , 2000, SIGP.

[2]  Todd A. Anderson Optimizations in a private nursery-based garbage collector , 2010, ISMM '10.

[3]  Erez Petrank,et al.  An efficient parallel heap compaction algorithm , 2004, OOPSLA.

[4]  Nir Shavit,et al.  Parallel Garbage Collection for Shared Memory Multiprocessors , 2001, Java Virtual Machine Research and Technology Symposium.

[5]  Michael Wolf,et al.  C4: the continuously concurrent compacting collector , 2011, ISMM '11.

[6]  L.A. Smith,et al.  A Parallel Java Grande Benchmark Suite , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[7]  Michael Wolf,et al.  The pauseless GC algorithm , 2005, VEE '05.

[8]  Akinori Yonezawa,et al.  A Scalable Mark-Sweep Garbage Collector on Large-Scale Shared-Memory Machines , 1997, SC.

[9]  David Gay,et al.  Memory management with explicit regions , 1998, PLDI.

[10]  Simon L. Peyton Jones,et al.  Parallel generational-copying garbage collection with a block-structured heap , 2008, ISMM '08.

[11]  Michael Gschwind,et al.  Cell GC: using the cell synergistic processor as a garbage collection coprocessor , 2008, VEE '08.

[12]  Stephen M. Watt,et al.  A new approach to parallelising tracing algorithms , 2009, ISMM '09.

[13]  Kathryn S. McKinley,et al.  Data flow analysis for software prefetching linked data structures in Java , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[14]  Andrew W. Appel,et al.  Creating and preserving locality of java applications at allocation and garbage collection times , 2002, OOPSLA '02.

[15]  Chen Yang,et al.  A Fully Parallel LISP2 Compactor with Preservation of the Sliding Properties , 2008, LCPC.

[16]  C. Richard Attanasio,et al.  A Comparative Evaluation of Parallel Garbage Collector Implementations , 2001, LCPC.

[17]  David S. Munro,et al.  Starting with termination: a methodology for building distributed garbage collection algorithms , 2001, Proceedings 24th Australian Computer Science Conference. ACSC 2001.

[18]  Kathryn S. McKinley,et al.  Immix: a mark-region garbage collector with space efficiency, fast collection, and mutator performance , 2008, PLDI '08.

[19]  Elliot K. Kolodner,et al.  A parallel, incremental and concurrent GC for servers , 2002, PLDI '02.

[20]  Will Partain,et al.  The nofib Benchmark Suite of Haskell Programs , 1992, Functional Programming.

[21]  Erez Petrank,et al.  Tracing garbage collection on highly parallel platforms , 2010, ISMM '10.

[22]  Brad Calder,et al.  Reducing cache misses using hardware and software page placement , 1999, ICS '99.

[23]  V. T. Rajan,et al.  Java without the coffee breaks: a nonintrusive multiprocessor garbage collector , 2001, PLDI '01.

[24]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[25]  Evan Tick,et al.  Evaluation of Parallel Copying Garbage Collection on a Shared-Memory Multiprocessor , 1993, IEEE Trans. Parallel Distributed Syst..

[26]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[27]  Guy E. Blelloch,et al.  A parallel, real-time garbage collector , 2001, PLDI '01.

[28]  Chandra Krintz,et al.  Dynamic selection of application-specific garbage collectors , 2004, ISMM '04.

[29]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[30]  Erez Petrank,et al.  The Compressor: concurrent, incremental, and parallel compaction , 2006, PLDI '06.

[31]  Adrian Schüpbach,et al.  Embracing diversity in the Barrelfish manycore operating system , 2008 .

[32]  David M. Ungar,et al.  Hosting an object heap on manycore hardware: an exploration , 2009, DLS '09.