Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications

Modern data center applications exhibit deep software stacks, resulting in large instruction footprints that frequently cause instruction cache misses degrading performance, cost, and energy efficiency. Although numerous mechanisms have been proposed to mitigate instruction cache misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. We first investigate why existing I-cache miss mitigation mechanisms achieve sub-optimal performance for data center applications. We find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced cache line evictions that are not handled by existing replacement policies. Existing replacement policies are unable to mitigate wasteful evictions since they lack complete knowledge of a data center application’s complex program behavior.To make existing replacement policies aware of these eviction-inducing program behaviors, we propose Ripple, a novel software-only technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions. Ripple carefully identifies program con-texts that lead to I-cache misses and sparingly injects "cache line eviction" instructions in suitable program locations at link time. We evaluate Ripple using nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache. Specifically, Ripple achieves an average performance improvement of 1.6% (up to 2.13%) over prior work due to a mean 19% (up to 28.6%) I-cache miss reduction.

[1]  Efraim Rotem,et al.  Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake , 2017, IEEE Micro.

[2]  James E. Smith,et al.  Path-based next trace prediction , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Kei Hiraki,et al.  Inter-reference gap distribution replacement: an improved replacement algorithm for set-associative caches , 2004, ICS '04.

[4]  Harish Patil,et al.  Ispike: a post-link optimizer for the Intel/spl reg/ Itanium/spl reg/ architecture , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[5]  C. Wilkerson,et al.  A Dueling Segmented LRU Replacement Algorithm with Adaptive Bypassing , 2010 .

[6]  Hamid Sarbazi-Azad,et al.  MANA: Microarchitecting an Instruction Prefetcher , 2021, ArXiv.

[7]  Thomas F. Wenisch,et al.  A Primer on Hardware Prefetching , 2014, A Primer on Hardware Prefetching.

[8]  Zhe Wang,et al.  Perceptron learning for reuse prediction , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Thomas F. Wenisch,et al.  SoftSKU: Optimizing Server Architectures for Microservice Diversity @Scale , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[10]  Sang Lyul Min,et al.  On the existence of a spectrum of policies that subsumes the least recently used (LRU) and least frequently used (LFU) policies , 1999, SIGMETRICS '99.

[11]  Jaehyuk Huh,et al.  Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[12]  Tipp Moseley,et al.  AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  Reena Panda,et al.  B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors , 2012, IEEE Computer Architecture Letters.

[14]  Tanvir Ahmed Khan,et al.  DMon: Efficient Detection and Correction of Data Locality Problems Using Selective Profiling , 2021, OSDI.

[15]  Ben Niu,et al.  Reverse Debugging of Kernel Failures in Deployed Systems , 2020, USENIX Annual Technical Conference.

[16]  Hamid Sarbazi-Azad,et al.  Divide and Conquer Frontend Bottleneck , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[17]  Daniel A. Jiménez Insertion and promotion for tree-based PseudoLRU last-level caches , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Yannis Smaragdakis,et al.  Adaptive Caches: Effective Shaping of Cache Behavior to Workloads , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[19]  Adam M. Izraelevitz,et al.  The Rocket Chip Generator , 2016 .

[20]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[21]  Dharmendra S. Modha,et al.  CAR: Clock with Adaptive Replacement , 2004, FAST.

[22]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[23]  Calvin Lin,et al.  Applying Deep Learning to the Cache Replacement Problem , 2019, MICRO.

[24]  Daniel A. Jiménez,et al.  Evolution of the Samsung Exynos CPU Microarchitecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[25]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Boris Grot,et al.  Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Kevin Swersky,et al.  An Imitation Learning Approach for Cache Replacement , 2020, ICML.

[28]  Yannis Smaragdakis,et al.  EELRU: simple and effective adaptive page replacement , 1999, SIGMETRICS '99.

[29]  Christoforos E. Kozyrakis,et al.  Memory Hierarchy for Web Search , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[30]  André Seznec,et al.  The FNL+MMA Instruction Cache Prefetcher , 2020 .

[31]  George Candea,et al.  Failure sketching: a technique for automated root cause diagnosis of in-production failures , 2015, SOSP.

[32]  Guilherme Ottoni,et al.  Optimizing function placement for large-scale data-center applications , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[33]  David Xinliang Li,et al.  Lightweight feedback-directed cross-module optimization , 2010, CGO '10.

[34]  Thomas F. Wenisch,et al.  Temporal instruction fetch streaming , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[35]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[36]  Mikko H. Lipasti Cache Replacement Policies , 2016 .

[37]  Daniel A. Jiménez,et al.  Multiperspective Reuse Prediction , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Pierre Michaud,et al.  PIPS: Prefetching Instructions with Probabilistic Scouts , 2020 .

[39]  Eric Rotenberg,et al.  Trace cache: a low latency approach to high bandwidth instruction fetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[40]  Kei Hiraki,et al.  Access Map Pattern Matching for High Performance Data Cache Prefetch , 2011, J. Instr. Level Parallelism.

[41]  Daniel A. Jiménez,et al.  The Temporal Ancestry Prefetcher , 2020 .

[42]  Akanksha Jain,et al.  Back to the Future: Leveraging Belady's Algorithm for Improved Cache Replacement , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[43]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[44]  George Candea,et al.  Failure Sketches: A Better Way to Debug , 2015, HotOS.

[45]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[46]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[47]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[48]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[49]  Thomas F. Wenisch,et al.  RDIP: Return-address-stack Directed Instruction Prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[50]  Andrea Rosà,et al.  Renaissance: benchmarking suite for parallel applications on the JVM , 2019, PLDI.

[51]  Todd C. Mowry,et al.  Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[52]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[53]  Cheng-Chieh Huang,et al.  Boomerang: A Metadata-Free Architecture for Control Flow Delivery , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[54]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[55]  Mahesh Subramony,et al.  The AMD “Zen 2” Processor , 2020, IEEE Micro.

[56]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[57]  Neelu Shivprakash Kalani,et al.  Run-Jump-Run: Bouquet of Instruction Pointer Jumpers for High Performance Instruction Prefetching , 2020 .

[58]  Babak Falsafi,et al.  SHIFT: Shared history instruction fetch for lean-core server processors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[59]  Jinchun Kim,et al.  Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy , 2017, ASPLOS.

[60]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[61]  Boris Grot,et al.  Blasting through the Front-End Bottleneck with Shotgun , 2018, ASPLOS.

[62]  Peter J. Denning,et al.  Thrashing: its causes and prevention , 1968, AFIPS Fall Joint Computing Conference.

[63]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[64]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[65]  Tanvir Ahmed Khan,et al.  I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[66]  Michael F. P. O'Boyle,et al.  IATAC: a smart predictor to turn-off L2 cache lines , 2005, TACO.

[67]  Margaret Martonosi,et al.  Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, ISCA.

[68]  Alberto Ros,et al.  The Entangling Instruction Prefetcher , 2020, IEEE Computer Architecture Letters.

[69]  Babak Falsafi,et al.  Confluence: Unified instruction supply for scale-out servers , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[70]  Christoforos E. Kozyrakis,et al.  AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[71]  H. Irie,et al.  D-JOLT: Distant Jolt Prefetcher , 2020 .

[72]  Ben Niu,et al.  Lazy Diagnosis of In-Production Concurrency Bugs , 2017, SOSP.

[73]  Guilherme Ottoni,et al.  BOLT: A Practical Binary Optimizer for Data Centers and Beyond , 2018, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[74]  Christoforos E. Kozyrakis,et al.  ZSim: fast and accurate microarchitectural simulation of thousand-core systems , 2013, ISCA.

[75]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[76]  Dam Sunwoo,et al.  Rebasing Instruction Prefetching: An Industry Perspective , 2020, IEEE Computer Architecture Letters.

[77]  Samira Manabi Khan,et al.  Sampling Dead Block Prediction for Last-Level Caches , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[78]  Heiner Litz,et al.  Classifying Memory Access Patterns for Prefetching , 2020, ASPLOS.

[79]  Texas,et al.  BARÇA: Branch Agnostic Region Searching Algorithm , 2020 .

[80]  Daniel A. Jiménez,et al.  Exploring Predictive Replacement Policies for Instruction Cache and Branch Target Buffer , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[81]  Yan Solihin,et al.  Counter-based cache replacement algorithms , 2005, 2005 International Conference on Computer Design.

[82]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[83]  Calvin Lin,et al.  Rethinking Belady's Algorithm to Accommodate Prefetching , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[84]  Ben Niu,et al.  REPT: Reverse Debugging of Failures in Deployed Software , 2018, OSDI.

[85]  Jean-Loup Baer,et al.  Modified LRU policies for improving second-level cache behavior , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[86]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[87]  Pramod Bhatotia,et al.  Execution reconstruction: harnessing failure reoccurrences for failure reproduction , 2021, PLDI.

[88]  Guilherme Ottoni,et al.  Lightning BOLT: powerful, fast, and scalable binary optimization , 2021, CC.

[89]  Glenn Reinman,et al.  Fetch directed instruction prefetching , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[90]  Kei Hiraki,et al.  Unified memory optimizing architecture: memory subsystem control with a unified predictor , 2012, ICS '12.

[91]  J. Spencer Love,et al.  Caching strategies to improve disk system performance , 1994, Computer.

[92]  Jinson Koppanalil,et al.  The Arm Neoverse N1 Platform: Building Blocks for the Next-Gen Cloud-to-Edge Infrastructure SoC , 2020, IEEE Micro.

[93]  Onur Mutlu,et al.  The evicted-address filter: A unified mechanism to address both cache pollution and thrashing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).