Location-aware cache management for many-core processors with deep cache hierarchy

As cache hierarchies become deeper and the number of cores on a chip increases, managing caches becomes more important for performance and energy. However, current hardware cache management policies do not always adapt optimally to the applications behavior: e.g., caches may be polluted by data structures whose locality cannot be captured by the caches, and producer-consumer communication incurs multiple round trips of coherence messages per cache line transferred. We propose load and store instructions that carry hints regarding into which cache(s) the accessed data should be placed. Our instructions allow software to convey locality information to the hardware, while incurring minimal hardware cost and not affecting correctness. Our instructions provide a 1.07× speedup and a 1.24× energy efficiency boost, on average, according to simulations on a 64-core system with private L1 and L2 caches. With a large shared L3 cache added, the benefits increase, providing 1.33× energy reduction on average.

[1]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[2]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[3]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[4]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[5]  Evgenia Smirni,et al.  The KSR1: experimentation and modeling of poststore , 1993, SIGMETRICS '93.

[6]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[7]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[8]  Yue Wang,et al.  REVERSE-TIME MIGRATION , 1999 .

[9]  Wayne Berke,et al.  A Cache Technique for Synchronization Variables in Highly Parallel, Shared Memory Systems by , 1988 .

[10]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11]  William N. Scherer,et al.  Scalable queue-based spin locks with timeout , 2001, PPoPP '01.

[12]  José González,et al.  Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors , 2010, ISCA.

[13]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[14]  Michel Dubois,et al.  Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[15]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[16]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[17]  Pradeep Dubey,et al.  Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs , 2009, Proc. VLDB Endow..

[18]  Christoforos E. Kozyrakis,et al.  Comparing memory systems for chip multiprocessors , 2007, ISCA '07.

[19]  Francesco Zappa Nardelli,et al.  86-TSO : A Rigorous and Usable Programmer ’ s Model for x 86 Multiprocessors , 2010 .

[20]  David A. Patterson,et al.  Virtual Local Stores: Enabling Software-Managed Memory Hierarchies in Mainstream Computing Environments , 2009 .

[21]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[22]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[24]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.

[25]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[26]  Josep Torrellas,et al.  Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.

[27]  Henry Hoffmann,et al.  Remote Store Programming , 2010, HiPEAC.

[28]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[29]  Mark Alan Gebhart,et al.  Energy-efficient mechanisms for managing on-chip storage in throughput processors , 2012 .

[30]  David Black-Schaffer,et al.  Efficient techniques for predicting cache sharing and throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[31]  Martin Hopkins,et al.  Synergistic Processing in Cell's Multicore Architecture , 2006, IEEE Micro.

[32]  Donald Yeung,et al.  Studying multicore processor scaling via reuse distance analysis , 2013, ISCA.

[33]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[34]  Mitchell Hayenga,et al.  MadCache: A PC-aware Cache Insertion Policy , 2010 .

[35]  Nian-Feng Tzeng,et al.  Distributing Hot-Spot Addressing in Large-Scale Multiprocessors , 1987, IEEE Transactions on Computers.

[36]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[37]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[38]  David Eklov,et al.  StatStack: Efficient modeling of LRU caches , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[39]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[40]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[41]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[42]  David A. Wood,et al.  ASR: Adaptive Selective Replication for CMP Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[43]  FeautrierPaul Some efficient solutions to the affine scheduling problem , 1992 .

[44]  Srinivas Devadas,et al.  Dynamic Cache Partitioning via Columnization , 2000, DAC 2000.

[45]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[46]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[47]  Björn Lisper,et al.  Data cache locking for higher program predictability , 2003, SIGMETRICS '03.

[48]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[49]  N PattYale,et al.  Adaptive insertion policies for high performance caching , 2007 .