论文信息 - Exploring Predictive Replacement Policies for Instruction Cache and Branch Target Buffer

Exploring Predictive Replacement Policies for Instruction Cache and Branch Target Buffer

Modern processors support instruction fetch with the instruction cache (I-cache) and branch target buffer (BTB). Due to timing and area constraints, the I-cache and BTB must efficiently make use of their limited capacities. Blocks in the I-cache or entries in the BTB that have low potential for reuse should be replaced by more useful blocks/entries. This work explores predictive replacement policies based on reuse prediction that can be applied to both the I-cache and BTB. Using a large suite of recently released industrial traces, we show that predictive replacement policies can reduce misses in the I-cache and BTB. We introduce Global History Reuse Prediction (GHRP), a replacement technique that uses the history of past instruction addresses and their reuse behaviors to predict dead blocks in the I-cache and dead entries in the BTB. This paper describes the effectiveness of GHRP as a dead block replacement and bypass optimization for both the I-cache and BTB. For a 64KB set-associative I-cache with a 64B block size, GHRP lowers the I-cache misses per 1000 instructions (MPKI) by an average of 18% over the least-recently-used (LRU) policy on a set of 662 industrial workloads, performing significantly better than Static Re-reference Interval Prediction (SRRIP) and Sampling Dead Block Prediction (SDBP). For a 4K-entry BTB, GHRP lowers MPKI by an average of 30% over LRU, 23% over SRRIP, and 29% over SDBP.

[1] Yale N. Patt,et al. A comprehensive instruction fetch mechanism for a processor supporting speculative execution , 1992, MICRO 25.

[2] James R. Goodman,et al. Instruction Cache Replacement Policies and Organizations , 1985, IEEE Transactions on Computers.

[3] Scott A. Mahlke,et al. EFetch: Optimizing instruction fetch for event-driven web applications , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[4] Babak Falsafi,et al. Using dead blocks as a virtual victim cache , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5] Ulrich Mayer,et al. Two level bulk preload branch prediction , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[6] Wen-mei W. Hwu,et al. Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, ISCA 1997.

[7] Chau-Wen Tseng,et al. Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[8] Thomas A. Ziaja,et al. Sparc T4: A Dynamically Threaded Server-on-a-Chip , 2012, IEEE Micro.

[9] Henry G. Dietz,et al. Improving cache performance by selective cache bypass , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[10] Mahmut T. Kandemir,et al. Leakage energy management in cache hierarchies , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[11] Babak Falsafi,et al. Dead-block prediction & dead-block correlating prefetchers , 2001, ISCA 2001.

[12] Daniel A. Jiménez,et al. The impact of delay on the design of branch predictors , 2000, MICRO 33.

[13] Alan Jay Smith,et al. Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[14] S. McFarling. Combining Branch Predictors , 1993 .

[15] Michael F. P. O'Boyle,et al. IATAC: a smart predictor to turn-off L2 cache lines , 2005, TACO.

[16] Kevin Skadron,et al. Merging path and gshare indexing in perceptron branch prediction , 2005, TACO.

[17] Babak Falsafi,et al. Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] B. Fagin,et al. Partial resolution in branch target buffers , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[19] Arnold L. Rosenberg,et al. Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[20] Roland N. Ibbett,et al. An Analysis of Instruction-Fetching Strategies in Pipelined Computers , 1980, IEEE Transactions on Computers.

[21] Samira Manabi Khan,et al. Sampling Dead Block Prediction for Last-Level Caches , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[22] Edward S. Davidson,et al. Reducing conflicts in direct-mapped caches with a temporality-based design , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[23] Wen-mei W. Hwu,et al. Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[24] Chyi-Chang Miao,et al. Compiler managed micro-cache bypassing for high performance EPIC processors , 2002, MICRO.

[25] Margaret Martonosi,et al. Speculative Updates of Local and Global Branch History: A Quantitative Analysis , 2000, J. Instr. Level Parallelism.

[26] Hideki Ando,et al. A Cost-Effective Branch Target Buffer with a Two-Level Table Organization , 1999 .

[27] Ioana Burcea,et al. Phantom-BTB: a virtualized branch target buffer design , 2009, ASPLOS.

[28] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[29] Margaret Martonosi,et al. Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, ISCA.

[30] David A. Wood,et al. Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[31] Daniel A. Jiménez,et al. Fast Path-Based Neural Branch Prediction , 2003, MICRO.

[32] James E. Smith,et al. A study of branch prediction strategies , 1981, ISCA '98.

[33] Barry S. Fagin,et al. Partial resolution in branch target buffers , 1995, MICRO.

[34] Brad Burgess. Samsung exynos M1 processor , 2016, 2016 IEEE Hot Chips 28 Symposium (HCS).

[35] Margaret Martonosi,et al. Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[36] Per Stenström,et al. Enhancing Last-Level Cache Performance by Block Bypassing and Early Miss Determination , 2006, Asia-Pacific Computer Systems Architecture Conference.

[37] Mateo Valero,et al. Fetching instruction streams , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[38] Per Stenström,et al. A novel approach to cache block reuse predictions , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[39] Kathryn S. McKinley,et al. Cooperative caching with keep-me and evict-me , 2005, 9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05).

[40] Dirk Grunwald,et al. Reducing branch costs via branch alignment , 1994, ASPLOS VI.

[41] Yan Solihin,et al. Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[42] Josep Torrellas,et al. Optimizing instruction cache performance for operating system intensive workloads , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[43] Wen-mei W. Hwu,et al. Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[44] Babak Falsafi,et al. SHIFT: Shared history instruction fetch for lean-core server processors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[45] Michael D. Smith,et al. Procedure placement using temporal-ordering information , 1999, TOPL.

[46] Carole-Jean Wu,et al. SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[47] Gary S. Tyson,et al. A modified approach to data cache management , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[48] Brad Calder,et al. Efficient procedure mapping using cache line coloring , 1997, PLDI '97.

[49] Jaehyuk Huh,et al. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[50] Gary S. Tyson,et al. Utilizing reuse information in data cache management , 1998, ICS '98.

[51] Thomas F. Wenisch,et al. RDIP: Return-address-stack Directed Instruction Prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[52] Babak Falsafi,et al. Confluence: Unified instruction supply for scale-out servers , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[53] Thomas F. Wenisch,et al. Memory coherence activity prediction in commercial workloads , 2004, WMPI '04.

[54] Onur Mutlu,et al. A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[55] Chandra Krintz,et al. Cache-conscious data placement , 1998, ASPLOS VIII.

[56] W. W. Hwu,et al. Achieving high instruction cache performance with an optimizing compiler , 1989, ISCA '89.

[57] Daniel A. Jiménez,et al. Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[58] Cheng-Chieh Huang,et al. Boomerang: A Metadata-Free Architecture for Control Flow Delivery , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[59] Scott McFarling,et al. Program optimization for instruction caches , 1989, ASPLOS III.

[60] James R. Goodman,et al. The declining effectiveness of dynamic caching for general- purpose microprocessors , 1995 .

[61] Mateo Valero,et al. A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality , 1995, International Conference on Supercomputing.

[62] Chris H. Perleberg,et al. Branch Target Buffer Design and Optimization , 1993, IEEE Trans. Computers.

[63] Thomas F. Wenisch,et al. Temporal instruction fetch streaming , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[64] Onur Mutlu,et al. Exploiting compressed block size as an indicator of future reuse , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[65] Gary S. Tyson,et al. Active Management of Data Caches by Exploiting Reuse Information , 1999, IEEE Trans. Computers.

[66] Babak Falsafi,et al. Selective, accurate, and timely self-invalidation using last-touch prediction , 2000, ISCA '00.