Prefetch injection based on hardware monitoring and object metadata

Cache miss stalls hurt performance because of the large gap between memory and processor speeds - for example, the popular server benchmark SPEC JBB2000 spends 45% of its cycles stalled waiting for memory requests on the Itanium® 2 processor. Traversing linked data structures causes a large portion of these stalls. Prefetching for linked data structures remains a major challenge because serial data dependencies between elements in a linked data structure preclude the timely materialization of prefetch addresses. This paper presents Mississippi Delta (MS Delta), a novel technique for prefetching linked data structures that closely integrates the hardware performance monitor (HPM), the garbage collector's global view of heap and object layout, the type-level metadata inherent in type-safe programs, and JIT compiler analysis. The garbage collector uses the HPM's data cache miss information to identify cache miss intensive traversal paths through linked data structures, and then discovers regular distances (deltas) between these linked objects. JIT compiler analysis injects prefetch instructions using deltas to materialize prefetch addresses.We have implemented MS Delta in a fully dynamic profile-guided optimization system: the StarJIT dynamic compiler [1] and the ORP Java virtual machine [9]. We demonstrate a 28-29% reduction in stall cycles attributable to the high-latency cache misses targeted by MS Delta and a speedup of 11-14% on the cache miss intensive SPEC JBB2000 benchmark.

[1]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[2]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[3]  James R. Larus,et al.  Using generational garbage collection to implement cache-conscious data placement , 1998, ISMM '98.

[4]  Harish Patil,et al.  Profile-guided post-link stride prefetching , 2002, ICS '02.

[5]  Robert Fenichel,et al.  A LISP garbage-collector for virtual-memory computer systems , 1969, CACM.

[6]  J. Eliot B. Moss,et al.  Cycles to recycle: garbage collection to the IA-64 , 2000, ISMM '00.

[7]  Kathryn S. McKinley,et al.  Data flow analysis for software prefetching linked data structures in Java , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[8]  Rajiv Arora,et al.  Java server performance: A case study of building efficient, scalable Jvms , 2000, IBM Syst. J..

[9]  Mauricio J. Serrano,et al.  The starjit compiler: a dynamic compiler for managed runtime environments , 2003 .

[10]  Brian T. Lewis,et al.  The Open Runtime Platform: a flexible high‐performance managed runtime environment , 2005, Concurr. Pract. Exp..

[11]  Amer Diwan,et al.  Connectivity-based garbage collection , 2003, OOPSLA '03.

[12]  Paul R. Wilson,et al.  Object Type Directed Garbage Collection To Improve Locality , 1992, IWMM.

[13]  Richard D. Greenblatt,et al.  A LISP machine , 1974, CAW '80.

[14]  Brad Calder,et al.  Quantifying load stream behavior , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[15]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[16]  Rakesh Krishnaiyer,et al.  Value-Profile Guided Stride Prefetching for Irregular Code , 2002, CC.

[17]  J. White,et al.  Address/memory management for a gigantic LISP environment or, GC considered harmful , 1987, LIPO.

[18]  Andrew W. Appel,et al.  Creating and preserving locality of java applications at allocation and garbage collection times , 2002, OOPSLA '02.

[19]  Mauricio J. Serrano,et al.  Characterizing the memory behavior of Java workloads: a structured view and opportunities for optimizations , 2001, SIGMETRICS '01.

[20]  Rafael Dueire Lins,et al.  Garbage collection: algorithms for automatic dynamic memory management , 1996 .

[21]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[22]  Wei-Chung Hsu,et al.  The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System , 2003, MICRO.

[23]  Huiyang Zhou,et al.  Detecting global stride locality in value streams , 2003, ISCA '03.

[24]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[25]  Paul R. Wilson,et al.  Effective “static-graph” reorganization to improve locality in garbage-collected systems , 1991, PLDI '91.

[26]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[27]  Toshio Nakatani,et al.  Stride prefetching by dynamically inspecting objects , 2003, PLDI '03.

[28]  Brian T. Lewis,et al.  Improving 64-bit Java IPF performance by compressing heap references , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[29]  J. Eliot B. Moss,et al.  Sapphire: copying garbage collection without stopping the world , 2003, Concurr. Comput. Pract. Exp..

[30]  Gurindar S. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, ISCA.

[31]  Urs Hölzle,et al.  A Study of the Allocation Behavior of the SPECjvm98 Java Benchmark , 1999, ECOOP.

[32]  Donald Yeung,et al.  Physical experimentation with prefetching helper threads on Intel's hyper-threaded processors , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[33]  Qiang Wu,et al.  Exposing memory access regularities using object-relative memory profiling , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..