A Cost-Effective Entangling Prefetcher for Instructions

Prefetching instructions in the instruction cache is a fundamental technique for designing high-performance computers. There are three key properties to consider when designing an efficient and effective prefetcher: timeliness, coverage, and accuracy. Timeliness is essential, as bringing instructions too early increases the risk of the instructions being evicted from the cache before their use and requesting them too late can lead to the instructions arriving after they are demanded. Coverage is important to reduce the number of instruction cache misses and accuracy to ensure that the prefetcher does not pollute the cache or interacts negatively with the other hardware mechanisms.This paper presents the Entangling Prefetcher for Instructions that entangles instructions to maximize timeliness. The prefetcher works by finding which instruction should trigger the prefetch for a subsequent instruction, accounting for the latency of each cache miss. The prefetcher is carefully adjusted to account for both coverage and accuracy. Our evaluation shows that with 40KB of storage, Entangling can increase performance up to 23%, outperforming state-of-the-art prefetchers.

[1]  Glenn Reinman,et al.  Fetch directed instruction prefetching , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[2]  Glenn Reinman,et al.  Optimizations Enabled by a Decoupled Front-End Architecture , 2001, IEEE Trans. Computers.

[3]  Yi Zhang,et al.  Execution History Guided Instruction Prefetching , 2002, ICS '02.

[4]  Jean-Loup Baer,et al.  Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors , 2009 .

[5]  Daniel A. Jiménez,et al.  The Temporal Ancestry Prefetcher , 2020 .

[6]  Trevor N. Mudge,et al.  Instruction prefetching using branch prediction information , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[7]  Daniel A. Jiménez,et al.  Evolution of the Samsung Exynos CPU Microarchitecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[8]  Hamid Sarbazi-Azad,et al.  MANA: Microarchitecting an Instruction Prefetcher , 2021, ArXiv.

[9]  Alexander V. Veidenbaum,et al.  Instruction Cache Prefetching Using Multilevel Branch Prediction , 1997, ISHPC.

[10]  Dam Sunwoo,et al.  Rebasing Instruction Prefetching: An Industry Perspective , 2020, IEEE Computer Architecture Letters.

[11]  Mateo Valero,et al.  Enlarging Instruction Streams , 2007, IEEE Transactions on Computers.

[12]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[13]  Pierre Michaud Best-offset hardware prefetching , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Cheng-Chieh Huang,et al.  Boomerang: A Metadata-Free Architecture for Control Flow Delivery , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[15]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[16]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[17]  Tanvir Ahmed Khan,et al.  I-SPY: Context-Driven Conditional Instruction Prefetching with Coalescing , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  David Black-Schaffer,et al.  Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling , 2014, CGO '14.

[20]  Boris Grot,et al.  Blasting through the Front-End Bottleneck with Shotgun , 2018, ASPLOS.

[21]  Jung Ho Ahn,et al.  CACTI-P: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques , 2011, 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[22]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[23]  Thomas F. Wenisch,et al.  A Primer on Hardware Prefetching , 2014, A Primer on Hardware Prefetching.

[24]  Jinchun Kim,et al.  Path confidence based lookahead prefetching , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Mateo Valero,et al.  Fetching instruction streams , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[26]  Santosh G. Abraham,et al.  Effective instruction prefetching in chip multiprocessors for modern commercial applications , 2005, 11th International Symposium on High-Performance Computer Architecture.

[27]  Thomas F. Wenisch,et al.  RDIP: Return-address-stack Directed Instruction Prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Alberto Ros,et al.  The Entangling Instruction Prefetcher , 2020, IEEE Computer Architecture Letters.

[29]  Babak Falsafi,et al.  Confluence: Unified instruction supply for scale-out servers , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[31]  Christoforos E. Kozyrakis,et al.  AsmDB: Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[32]  H. Irie,et al.  D-JOLT: Distant Jolt Prefetcher , 2020 .

[33]  Trevor N. Mudge,et al.  Wrong-path instruction prefetching , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[34]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[35]  André Seznec,et al.  The FNL+MMA Instruction Cache Prefetcher , 2020 .

[36]  Yale N. Patt,et al.  Target prediction for indirect jumps , 1997, ISCA '97.

[37]  John Paul Shen,et al.  Hardware Support for Prescient Instruction Prefetch , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[38]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[39]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[41]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[42]  Margaret Martonosi,et al.  TCP: tag correlating prefetchers , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[43]  Gary S. Tyson,et al.  Branch history guided instruction prefetching , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[44]  Thomas F. Wenisch,et al.  Temporal instruction fetch streaming , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[45]  Hamid Sarbazi-Azad,et al.  Divide and Conquer Frontend Bottleneck , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).