Evaluation of Hardware Data Prefetchers on Server Processors

Data prefetching, i.e., the act of predicting an application’s future memory accesses and fetching those that are not in the on-chip caches, is a well-known and widely used approach to hide the long latency of memory accesses. The fruitfulness of data prefetching is evident to both industry and academy: Nowadays, almost every high-performance processor incorporates a few data prefetchers for capturing various access patterns of applications; besides, there is a myriad of proposals for data prefetching in the research literature, where each proposal enhances the efficiency of prefetching in a specific way. In this survey, we evaluate the effectiveness of data prefetching in the context of server applications and shed light on its design trade-offs. To do so, we choose a target architecture based on a contemporary server processor and stack various state-of-the-art data prefetchers on top of it. We analyze the prefetchers in terms of their ability to predict memory accesses and enhance overall system performance, as well as their imposed overheads. Finally, by comparing the state-of-the-art prefetchers with impractical ideal prefetchers, we motivate further work on improving data prefetching techniques.

[1]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[2]  Dean M. Tullsen,et al.  Multithreading Architecture , 2013, Multithreading Architecture.

[3]  Anastasia Ailamaki,et al.  Improving hash join performance through prefetching , 2004, Proceedings. 20th International Conference on Data Engineering.

[4]  Babak Falsafi,et al.  Quantifying the Mismatch between Emerging Scale-Out Applications and Modern Processors , 2012, TOCS.

[5]  Daniel A. Jiménez,et al.  Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[6]  Anantha Chandrakasan,et al.  SMART: A single-cycle reconfigurable NoC for SoC applications , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Josep Torrellas,et al.  The memory performance of DSS commercial workloads in shared-memory multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[8]  IBM Blue Gene team,et al.  Design of the IBM Blue Gene/Q Compute chip , 2013, IBM J. Res. Dev..

[9]  Aamer Jaleel,et al.  Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[10]  James R. Larus,et al.  Using Cohort-Scheduling to Enhance Server Performance , 2002, USENIX Annual Technical Conference, General Track.

[11]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[12]  Trevor N. Mudge,et al.  Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.

[13]  Thomas F. Wenisch,et al.  Enhancing Server Efficiency in the Face of Killer Microseconds , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Calvin Lin,et al.  Memory Prefetching Using Adaptive Stream Detection , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[15]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[16]  Vivek Sarkar,et al.  In-Register Parameter Caching for Dynamic Neural Nets with Virtual Persistent Processor Specialization , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Hamid Sarbazi-Azad,et al.  Bingo Spatial Data Prefetcher , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Pat Conway,et al.  The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[19]  Mehdi Modarressi,et al.  Fast Data Delivery for Many-Core Processors , 2018, IEEE Transactions on Computers.

[20]  Trishul M. Chilimbi On the stability of temporal data reference profiles , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[22]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[23]  Jack Doweck,et al.  Inside Intel® Core microarchitecture , 2006, 2006 IEEE Hot Chips 18 Symposium (HCS).

[24]  Nael B. Abu-Ghazaleh,et al.  CORF: Coalescing Operand Register File for GPUs , 2019, ASPLOS.

[25]  Mehmet Kayaalp,et al.  RIC: Relaxed Inclusion Caches for mitigating LLC side-channel attacks , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[26]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[27]  Richard E. Kessler,et al.  Evaluating stream buffers as a secondary cache replacement , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[28]  Yuan Chou,et al.  Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[29]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[30]  Jinchun Kim,et al.  Path confidence based lookahead prefetching , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[32]  Onur Mutlu,et al.  Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks , 2014, ACM Trans. Archit. Code Optim..

[33]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[34]  Francisco J. Cazorla,et al.  Making data prefetch smarter: Adaptive prefetching on POWER7 , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35]  Gary Lauterbach,et al.  UltraSPARC-III: designing third-generation 64-bit performance , 1999, IEEE Micro.

[36]  Christopher J. Hughes,et al.  Memory-side prefetching for linked data structures for processor-in-memory systems , 2005, J. Parallel Distributed Comput..

[37]  Onur Mutlu,et al.  Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems , 2010, ASPLOS 2010.

[38]  Babak Falsafi,et al.  To Share or Not To Share? , 2007, VLDB.

[39]  Kei Hiraki,et al.  Access map pattern matching for data cache prefetch , 2009, ICS.

[40]  Babak Falsafi,et al.  Scale-out processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[41]  Sally A. McKee,et al.  Hardware-only stream prefetching and dynamic access ordering , 2000, ICS '00.

[42]  Martin Burtscher,et al.  Future execution: A prefetching mechanism that uses multiple cores to speed up single threads , 2006, TACO.

[43]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[44]  Santosh G. Abraham,et al.  Effective stream-based and execution-based data prefetching , 2004, ICS '04.

[45]  Martin Hirzel,et al.  Dynamic hot data stream prefetching for general-purpose programs , 2002, PLDI '02.

[46]  Carole-Jean Wu,et al.  Characterization and dynamic mitigation of intra-application cache interference , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[47]  Dean M. Tullsen,et al.  Inter-core prefetching for multicore processors using migrating helper threads , 2011, ASPLOS XVI.

[48]  Jaejin Lee,et al.  Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems , 2009, IEEE Transactions on Parallel and Distributed Systems.

[49]  James E. Smith,et al.  A first-order superscalar processor model , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[50]  Onur Mutlu,et al.  Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[51]  Thomas F. Wenisch,et al.  Spatio-temporal memory streaming , 2009, ISCA '09.

[52]  Hamid Sarbazi-Azad,et al.  Near-Ideal Networks-on-Chip for Servers , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[53]  Thomas F. Wenisch,et al.  A Primer on Hardware Prefetching , 2014, A Primer on Hardware Prefetching.

[54]  Josep Torrellas,et al.  Using a user-level memory thread for correlation prefetching , 2002, ISCA.

[55]  Pejman Lotfi-Kamran,et al.  Cache Replacement Policy Based on Expected Hit Count , 2018, IEEE Computer Architecture Letters.

[56]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[57]  C. Grünloh To Share Or Not To Share , 2019, Case Medical Research.

[58]  Onur Mutlu,et al.  Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[59]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[60]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[61]  Mahmut T. Kandemir,et al.  Meeting midway: Improving CMP performance with memory-side prefetching , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[62]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[63]  Thomas F. Wenisch,et al.  The Queuing-First Approach for Tail Management of Interactive Services , 2019, IEEE Micro.

[64]  John Paul Shen,et al.  Scaling and characterizing database workloads: bridging the gap between research and practice , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[65]  Hamid Sarbazi-Azad,et al.  LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching , 2018, ASPLOS.

[66]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[67]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[68]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[69]  Hamid Sarbazi-Azad,et al.  An Efficient Hybrid-Switched Network-on-Chip for Chip Multiprocessors , 2016, IEEE Transactions on Computers.

[70]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[71]  Onur Mutlu,et al.  Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[72]  Babak Falsafi,et al.  NOC-Out: Microarchitecting a Scale-Out Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[73]  Zhenman Fang,et al.  Multi-stage coordinated prefetching for present-day processors , 2014, ICS '14.

[74]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[75]  Hamid Sarbazi-Azad,et al.  Domino Temporal Data Prefetcher , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[76]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[77]  G. Sohi,et al.  Effective jump-pointer prefetching for linked data structures , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[78]  Pejman Lotfi-Kamran,et al.  An Efficient Temporal Data Prefetcher for L1 Caches , 2017, IEEE Computer Architecture Letters.

[79]  Thomas F. Wenisch,et al.  Practical off-chip meta-data for temporal memory streaming , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[80]  Babak Falsafi,et al.  Accurate and complexity-effective spatial pattern prediction , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[81]  Brad Calder,et al.  A Decoupled Predictor-Directed Stream Prefetching Architecture , 2003, IEEE Trans. Computers.

[82]  Sarita V. Adve,et al.  Performance of database workloads on shared-memory systems with out-of-order processors , 1998, ASPLOS VIII.

[83]  Babak Falsafi,et al.  Predictor virtualization , 2008, ASPLOS.

[84]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[85]  Gu-Yeon Wei,et al.  Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[86]  Vivek Sarkar,et al.  RegMutex: Inter-Warp GPU Register Time-Sharing , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[87]  Reena Panda,et al.  B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[88]  Hamid Sarbazi-Azad,et al.  Scale-Out Processors & Energy Efficiency , 2018, ArXiv.

[89]  Per Stenström,et al.  Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[90]  Jan Reineke,et al.  Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.

[91]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[92]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[93]  Haitham Akkary,et al.  A dynamic multithreading processor , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[94]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[95]  Seth H. Pugsley,et al.  Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[96]  Mikko H. Lipasti,et al.  Stealth prefetching , 2006, ASPLOS XII.

[97]  Mahmut T. Kandemir,et al.  Adaptive prefetching for shared cache based chip multiprocessors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[98]  Mor Harchol-Balter,et al.  ATLAS : A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers , 2010 .

[99]  Babak Falsafi,et al.  Database Servers on Chip Multiprocessors: Limitations and Opportunities , 2007, CIDR.

[100]  Pierre Michaud Best-offset hardware prefetching , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[101]  Weifeng Zhang,et al.  A self-repairing prefetcher in an event-driven dynamic optimization framework , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[102]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[103]  Carole-Jean Wu,et al.  PACMan: Prefetch-Aware Cache Management for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[104]  Susan J. Eggers,et al.  An analysis of database workload performance on simultaneous multithreaded processors , 1998, ISCA.

[105]  Sanjeev Kumar,et al.  Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[106]  Marcelo Cintra,et al.  Stream chaining: exploiting multiple levels of correlation in data prefetching , 2009, ISCA '09.

[107]  Thomas F. Wenisch,et al.  Temporal streaming of shared memory , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[108]  Junfeng Yang,et al.  Stable Deterministic Multithreading through Schedule Memoization , 2010, OSDI.

[109]  Onur Mutlu,et al.  Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[110]  Sparsh Mittal,et al.  A Survey of Recent Prefetching Techniques for Processor Caches , 2016, ACM Comput. Surv..

[111]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[112]  Babak Falsafi,et al.  Optimizing Data-Center TCO with Scale-Out Processors , 2012, IEEE Micro.

[113]  David J. DeWitt,et al.  DBMSs on a Modern Processor: Where Does Time Go? , 1999, VLDB.

[114]  K.J. Nesbit,et al.  AC/DC: an adaptive data cache prefetcher , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[115]  Hyesoon Kim,et al.  Techniques for Efficient Processing in Runahead Execution Engines , 2005, ISCA 2005.

[116]  Onur Mutlu,et al.  Techniques for efficient processing in runahead execution engines , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[117]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, MICRO.

[118]  Hamid Sarbazi-Azad,et al.  Reducing Writebacks Through In-Cache Displacement , 2019, ACM Trans. Design Autom. Electr. Syst..

[119]  Hamid Sarbazi-Azad,et al.  Die-Stacked DRAM: Memory, Cache, or MemCache? , 2018, ArXiv.

[120]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.