RnR: A Software-Assisted Record-and-Replay Hardware Prefetcher

Applications with irregular memory access patterns do not benefit well from the memory hierarchy as applications that have good locality do. Relatively high miss ratio and long memory access latency can cause the processor to stall and degrade system performance. Prefetching can help to hide the miss penalty by predicting which memory addresses will be accessed in the near future and issuing memory requests ahead of the time. However, software prefetchers add instruction overhead, whereas hardware prefetchers cannot efficiently predict irregular memory access sequences with high accuracy. Fortunately, in many important irregular applications (e.g., iterative solvers, graph algorithms, and sparse matrix-vector multiplication), memory access sequences repeat over multiple iterations or program phases. When the patterns are long, a conventional spatial-temporal prefetcher can not achieve high prefetching accuracy, but these repeating patterns can be identified by programmers.In this work, we propose a software-assisted hardware prefetcher that focuses on repeating irregular memory access patterns for data structures that cannot benefit from conventional hardware prefetchers. The key idea is to provide a programming interface to record cache miss sequence on the first appearance of a memory access pattern and prefetch through replaying the pattern on the following repeats. The proposed Record-and-Replay (RnR) prefetcher provides a lightweight software interface so that the programmers can specify in the application code: 1) which data structures have irregular memory accesses, 2) when to start the recording, and 3) when to start the replay (prefetching). This work evaluated three irregular workloads with different inputs. For the evaluated workloads and inputs, the proposed RnR prefetcher can achieve on average 2.16× speedup for graph applications and 2.91× speedup for an iterative solver with a sparse matrix-vector multiplication kernel. By leveraging the knowledge from the programmers, the proposed RnR prefetcher can achieve over 95% prefetching accuracy and miss coverage.

[1]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Ada Gavrilovska,et al.  Balancing context switch penalty and response time with elastic time slicing , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[3]  Rok Sosic,et al.  SNAP , 2016, ACM Trans. Intell. Syst. Technol..

[4]  Chen Ding,et al.  Quantifying the cost of context switch , 2007, ExpCS '07.

[5]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[6]  Calvin Lin,et al.  Linearizing irregular memory accesses for improved correlated prefetching , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[8]  Hamid Sarbazi-Azad,et al.  Bingo Spatial Data Prefetcher , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[9]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[10]  Alvin AuYoung,et al.  Presto: distributed machine learning and graph processing with sparse matrices , 2013, EuroSys '13.

[11]  Sarita V. Adve,et al.  Performance of database workloads on shared-memory systems with out-of-order processors , 1998, ASPLOS VIII.

[12]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[13]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[14]  Paul D. Franzon,et al.  FreePDK: An Open-Source Variation-Aware Design Kit , 2007, 2007 IEEE International Conference on Microelectronic Systems Education (MSE'07).

[15]  Pradeep Dubey,et al.  Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and its Application to Unstructured Matrices , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[17]  Cong Du,et al.  MPI-Mitten: Enabling Migration Technology in MPI , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[18]  Aamer Jaleel,et al.  Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[19]  Hiroyuki Kitagawa,et al.  GPU-Accelerated Graph Clustering via Parallel Label Propagation , 2017, CIKM.

[20]  Dam Sunwoo,et al.  Temporal Prefetching Without the Off-Chip Metadata , 2019, MICRO.

[21]  Christina Freytag,et al.  Using Mpi Portable Parallel Programming With The Message Passing Interface , 2016 .

[22]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[23]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[24]  Yu He,et al.  The YouTube video recommendation system , 2010, RecSys '10.

[25]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[26]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[27]  Frederica Darema,et al.  A single-program-multiple-data computational model for EPEX/FORTRAN , 1988, Parallel Comput..

[28]  Pierre Michaud Best-offset hardware prefetching , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Brad Calder,et al.  Predictor-directed stream buffers , 2000, MICRO 33.

[30]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[31]  James E. Smith,et al.  Prefetching in supercomputer instruction caches , 1992, Proceedings Supercomputing '92.

[32]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[33]  Li Zhao,et al.  Analysis and Optimization of the Memory Hierarchy for Graph Processing Workloads , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[34]  Hamid Sarbazi-Azad,et al.  Domino Temporal Data Prefetcher , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[35]  David A. Patterson,et al.  The GAP Benchmark Suite , 2015, ArXiv.

[36]  Seth H. Pugsley,et al.  Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Marco Rosa,et al.  HyperANF: approximating the neighbourhood function of very large graphs on a budget , 2010, WWW.

[38]  Hao Wu,et al.  Efficient Metadata Management for Irregular Data Prefetching , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[39]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[40]  Jack J. Dongarra,et al.  High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems , 2016, Int. J. High Perform. Comput. Appl..

[41]  Sam Ainsworth,et al.  Software prefetching for indirect memory accesses , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[42]  Pat Conway,et al.  The AMD Opteron Processor for Multiprocessor Servers , 2003, IEEE Micro.

[43]  James E. Smith,et al.  Data Cache Prefetching Using a Global History Buffer , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[44]  Daniel A. Jiménez,et al.  Dynamic branch prediction with perceptrons , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[45]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[46]  Torsten Hoefler,et al.  To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations , 2017, HPDC.

[47]  Heiner Litz,et al.  Classifying Memory Access Patterns for Prefetching , 2020, ASPLOS.

[48]  Christos Faloutsos,et al.  Mining large graphs: Algorithms, inference, and discoveries , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[49]  Sam Ainsworth,et al.  An Event-Triggered Programmable Prefetcher for Irregular Workloads , 2018, ASPLOS.