Reducing Load Latency with Cache Level Prediction

High load latency that results from deep cache hierarchies and relatively slow main memory is an important limiter of single-thread performance. Data prefetch helps reduce this latency by fetching data up the hierarchy before it is requested by load instructions. However, data prefetching has shown to be imperfect in many situations. We propose cachelevel prediction to complement prefetchers. Our method predicts which memory hierarchy level a load will access allowing the memory loads to start earlier, and thereby saves many cycles. The predictor provides high prediction accuracy at the cost of just one cycle added latency to L1 misses. Experimental results show speedup of 7.8% on generic, graph, and HPC applications over a baseline with aggressive prefetchers.

[1]  Magnus Jahre,et al.  Multi-level Hardware Prefetching Using Low Complexity Delta Correlating Prediction Tables with Partial Matching , 2010, HiPEAC.

[2]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Hao Wu,et al.  Efficient Metadata Management for Irregular Data Prefetching , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[5]  Aamer Jaleel,et al.  ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[6]  Yu Wang,et al.  Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[7]  Sam Ainsworth,et al.  Software prefetching for indirect memory accesses , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[8]  André Seznec,et al.  A new case for the TAGE branch predictor , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Reena Panda,et al.  Experiments with SPEC CPU 2017 : Similarity , Balance , Phase Behavior and SimPoints , 2018 .

[10]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, MICRO.

[12]  Hamid Sarbazi-Azad,et al.  Bingo Spatial Data Prefetcher , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13]  Josep Torrellas,et al.  Attack Directories, Not Caches: Side Channel Attacks in a Non-Inclusive World , 2019, 2019 IEEE Symposium on Security and Privacy (SP).

[14]  Sam Ainsworth,et al.  Graph Prefetching Using Data Structure Knowledge , 2016, ICS.

[15]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[17]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[18]  David Black-Schaffer,et al.  Navigating the cache hierarchy with a single lookup , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[19]  Shih-Lien Lu,et al.  Bloom filtering cache misses for accurate data speculation and prefetching , 2014, ICS 25th Anniversary.

[20]  Hamid Sarbazi-Azad,et al.  Domino Temporal Data Prefetcher , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Sam Ainsworth,et al.  An Event-Triggered Programmable Prefetcher for Irregular Workloads , 2018, ASPLOS.

[22]  David Black-Schaffer,et al.  A Split Cache Hierarchy for Enabling Data-Oriented Optimizations , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[23]  Hong Wang,et al.  Criticality Aware Tiered Cache Hierarchy: A Fundamental Relook at Multi-Level Cache Hierarchies , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24]  Babak Falsafi,et al.  Proactive instruction fetch , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[27]  Stéphan Jourdan,et al.  Speculation techniques for improving load related instruction scheduling , 1999, ISCA.

[28]  Jinchun Kim,et al.  Lookahead Prefetching with Signature Path , 2015 .

[29]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  David Black-Schaffer,et al.  Filter Caching for Free: The Untapped Potential of the Store-Buffer , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[31]  Kei Hiraki,et al.  Access map pattern matching for data cache prefetch , 2009, ICS.