RADAR: Runtime-assisted dead region management for last-level caches

Last-level caches (LLCs) bridge the processor/memory speed gap and reduce energy consumed per access. Unfortunately, LLCs are poorly utilized because of the relatively large occurrence of dead blocks. We propose RADAR, a hybrid static/dynamic dead-block management technique that can accurately predict and evict dead blocks in LLCs. RADAR does dead-block prediction and eviction at the granularity of address regions supported in many of today's task-parallel programming models. The runtime system utilizes static control-flow information about future region accesses in conjunction with past region access patterns to make accurate predictions about dead regions. The runtime system instructs the cache to demote and eventually evict blocks belonging to such dead regions. This paper considers three RADAR schemes to predict dead regions: a scheme that uses control-flow information provided by the programming model (Look-ahead), a history-based scheme (Look-back) and a combined scheme (Look-ahead and Look-back). Our evaluation shows that, on average, all RADAR schemes outperform state-of-the-art hardware dead-block prediction techniques, whereas the combined scheme always performs best.

[1]  Jaehyuk Huh,et al.  Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[2]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[3]  Lieven Eeckhout,et al.  Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Lieven Eeckhout,et al.  Cooperative cache scrubbing , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[5]  Michael F. P. O'Boyle,et al.  IATAC: a smart predictor to turn-off L2 cache lines , 2005, TACO.

[6]  Per Stenström,et al.  Runtime-Guided Cache Coherence Optimizations in Multi-core Architectures , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[7]  William J. Dally,et al.  Architectural Support for the Stream Execution Model on General-Purpose Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[8]  Ioana Burcea,et al.  Pointy: A hybrid pointer prefetcher for managed runtime systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[10]  Eduard Ayguadé,et al.  Runtime-Aware Architectures: A First Approach , 2014, Supercomput. Front. Innov..

[11]  Michel Dubois,et al.  Self-correcting LRU replacement policies , 2004, CF '04.

[12]  Aamer Jaleel,et al.  Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[13]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[14]  Kristof Beyls,et al.  Generating cache hints for improved program efficiency , 2005, J. Syst. Archit..

[15]  Samira Manabi Khan,et al.  Sampling Dead Block Prediction for Last-Level Caches , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.

[17]  Srinivas Devadas,et al.  Software-assisted cache replacement mechanisms for embedded systems , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[18]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[19]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[20]  Per Stenström,et al.  Efficient Forwarding of Producer-Consumer Data in Task-Based Programs , 2013, 2013 42nd International Conference on Parallel Processing.

[21]  Stefanos Kaxiras,et al.  Cache replacement based on reuse-distance prediction , 2007, 2007 25th International Conference on Computer Design.

[22]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[23]  Per Stenström,et al.  A novel approach to cache block reuse predictions , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[24]  Kristof Beyls,et al.  Reuse Distance-Based Cache Hint Selection , 2002, Euro-Par.

[25]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[26]  Eduard Ayguadé,et al.  Task Superscalar: An Out-of-Order Task Pipeline , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[27]  M. Zahran Cache Replacement Policy Revisited , 2022 .

[28]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[29]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[30]  Sarita V. Adve,et al.  Stash: Have your scratchpad and cache it too , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[31]  Mainak Chaudhuri,et al.  Introducing Hierarchy-awareness in replacement and bypass algorithms for last-level caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Eduard Ayguadé,et al.  Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[33]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[34]  Mainak Chaudhuri,et al.  Bypass and insertion algorithms for exclusive last-level caches , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[35]  Satoshi Matsuoka,et al.  Fork-Join and Data-Driven Execution Models on Multi-core Architectures: Case Study of the FMM , 2013, ISC.

[36]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Kathryn S. McKinley,et al.  Cooperative caching with keep-me and evict-me , 2005, 9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05).

[38]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[39]  Paraskevas Evripidou,et al.  CacheFlow: A Short-Term Optimal Cache Management Policy for Data Driven Multithreading , 2004, Euro-Par.

[40]  Yale N. Patt,et al.  A two-level approach to making class predictions , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[41]  Dionisios N. Pnevmatikatos,et al.  Prefetching and cache management using task lifetimes , 2013, ICS '13.