Exploring Predictive Replacement Policies for Instruction Cache and Branch Target Buffer

Modern processors support instruction fetch with the instruction cache (I-cache) and branch target buffer (BTB). Due to timing and area constraints, the I-cache and BTB must efficiently make use of their limited capacities. Blocks in the I-cache or entries in the BTB that have low potential for reuse should be replaced by more useful blocks/entries. This work explores predictive replacement policies based on reuse prediction that can be applied to both the I-cache and BTB. Using a large suite of recently released industrial traces, we show that predictive replacement policies can reduce misses in the I-cache and BTB. We introduce Global History Reuse Prediction (GHRP), a replacement technique that uses the history of past instruction addresses and their reuse behaviors to predict dead blocks in the I-cache and dead entries in the BTB. This paper describes the effectiveness of GHRP as a dead block replacement and bypass optimization for both the I-cache and BTB. For a 64KB set-associative I-cache with a 64B block size, GHRP lowers the I-cache misses per 1000 instructions (MPKI) by an average of 18% over the least-recently-used (LRU) policy on a set of 662 industrial workloads, performing significantly better than Static Re-reference Interval Prediction (SRRIP) and Sampling Dead Block Prediction (SDBP). For a 4K-entry BTB, GHRP lowers MPKI by an average of 30% over LRU, 23% over SRRIP, and 29% over SDBP.

[1]  Yale N. Patt,et al.  A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 25.

[2]  James R. Goodman,et al.  Instruction Cache Replacement Policies and Organizations , 1985, IEEE Transactions on Computers.

[3]  Scott A. Mahlke,et al.  EFetch: Optimizing instruction fetch for event-driven web applications , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[4]  Babak Falsafi,et al.  Using dead blocks as a virtual victim cache , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Ulrich Mayer,et al.  Two level bulk preload branch prediction , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[6]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, ISCA 1997.

[7]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[8]  Thomas A. Ziaja,et al.  Sparc T4: A Dynamically Threaded Server-on-a-Chip , 2012, IEEE Micro.

[9]  Henry G. Dietz,et al.  Improving cache performance by selective cache bypass , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[10]  Mahmut T. Kandemir,et al.  Leakage energy management in cache hierarchies , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[11]  Babak Falsafi,et al.  Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[12]  Daniel A. Jiménez,et al.  The impact of delay on the design of branch predictors , 2000, MICRO 33.

[13]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[14]  S. McFarling Combining Branch Predictors , 1993 .

[15]  Michael F. P. O'Boyle,et al.  IATAC: a smart predictor to turn-off L2 cache lines , 2005, TACO.

[16]  Kevin Skadron,et al.  Merging path and gshare indexing in perceptron branch prediction , 2005, TACO.

[17]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  B. Fagin,et al.  Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[19]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[20]  Roland N. Ibbett,et al.  An Analysis of Instruction-Fetching Strategies in Pipelined Computers , 1980, IEEE Transactions on Computers.

[21]  Samira Manabi Khan,et al.  Sampling Dead Block Prediction for Last-Level Caches , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Edward S. Davidson,et al.  Reducing conflicts in direct-mapped caches with a temporality-based design , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[23]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[24]  Chyi-Chang Miao,et al.  Compiler managed micro-cache bypassing for high performance EPIC processors , 2002, MICRO.

[25]  Margaret Martonosi,et al.  Speculative Updates of Local and Global Branch History: A Quantitative Analysis , 2000, J. Instr. Level Parallelism.

[26]  Hideki Ando,et al.  A Cost-Effective Branch Target Buffer with a Two-Level Table Organization , 1999 .

[27]  Ioana Burcea,et al.  Phantom-BTB: a virtualized branch target buffer design , 2009, ASPLOS.

[28]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[29]  Margaret Martonosi,et al.  Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, ISCA.

[30]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[31]  Daniel A. Jiménez,et al.  Fast Path-Based Neural Branch Prediction , 2003, MICRO.

[32]  James E. Smith,et al.  A study of branch prediction strategies , 1981, ISCA '98.

[33]  Barry S. Fagin,et al.  Partial resolution in branch target buffers , 1995, MICRO.

[34]  Brad Burgess Samsung exynos M1 processor , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[35]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[36]  Per Stenström,et al.  Enhancing Last-Level Cache Performance by Block Bypassing and Early Miss Determination , 2006, Asia-Pacific Computer Systems Architecture Conference.

[37]  Mateo Valero,et al.  Fetching instruction streams , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[38]  Per Stenström,et al.  A novel approach to cache block reuse predictions , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[39]  Kathryn S. McKinley,et al.  Cooperative caching with keep-me and evict-me , 2005, 9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05).

[40]  Dirk Grunwald,et al.  Reducing branch costs via branch alignment , 1994, ASPLOS VI.

[41]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[42]  Josep Torrellas,et al.  Optimizing instruction cache performance for operating system intensive workloads , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[43]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[44]  Babak Falsafi,et al.  SHIFT: Shared history instruction fetch for lean-core server processors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45]  Michael D. Smith,et al.  Procedure placement using temporal-ordering information , 1999, TOPL.

[46]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[48]  Brad Calder,et al.  Efficient procedure mapping using cache line coloring , 1997, PLDI '97.

[49]  Jaehyuk Huh,et al.  Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[50]  Gary S. Tyson,et al.  Utilizing reuse information in data cache management , 1998, ICS '98.

[51]  Thomas F. Wenisch,et al.  RDIP: Return-address-stack Directed Instruction Prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[52]  Babak Falsafi,et al.  Confluence: Unified instruction supply for scale-out servers , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[53]  Thomas F. Wenisch,et al.  Memory coherence activity prediction in commercial workloads , 2004, WMPI '04.

[54]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[55]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[56]  W. W. Hwu,et al.  Achieving high instruction cache performance with an optimizing compiler , 1989, ISCA '89.

[57]  Daniel A. Jiménez,et al.  Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[58]  Cheng-Chieh Huang,et al.  Boomerang: A Metadata-Free Architecture for Control Flow Delivery , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[59]  Scott McFarling,et al.  Program optimization for instruction caches , 1989, ASPLOS III.

[60]  James R. Goodman,et al.  The declining effectiveness of dynamic caching for general- purpose microprocessors , 1995 .

[61]  Mateo Valero,et al.  A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality , 1995, International Conference on Supercomputing.

[62]  Chris H. Perleberg,et al.  Branch Target Buffer Design and Optimization , 1993, IEEE Trans. Computers.

[63]  Thomas F. Wenisch,et al.  Temporal instruction fetch streaming , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[64]  Onur Mutlu,et al.  Exploiting compressed block size as an indicator of future reuse , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[65]  Gary S. Tyson,et al.  Active Management of Data Caches by Exploiting Reuse Information , 1999, IEEE Trans. Computers.

[66]  Babak Falsafi,et al.  Selective, accurate, and timely self-invalidation using last-touch prediction , 2000, ISCA '00.