Graph Prefetching Using Data Structure Knowledge

Searches on large graphs are heavily memory latency bound, as a result of many high latency DRAM accesses. Due to the highly irregular nature of the access patterns involved, caches and prefetchers, both hardware and software, perform poorly on graph workloads. This leads to CPU stalling for the majority of the time. However, in many cases the data access pattern is well defined and predictable in advance, many falling into a small set of simple patterns. Although existing implicit prefetchers cannot bring significant benefit, a prefetcher armed with knowledge of the data structures and access patterns could accurately anticipate applications' traversals to bring in the appropriate data. This paper presents a design of an explicitly configured prefetcher to improve performance for breadth-first searches and sequential iteration on the efficient and commonly-used compressed sparse row graph format. By snooping L1 cache accesses from the core and reacting to data returned from its own prefetches, the prefetcher can schedule timely loads of data in advance of the application needing it. For a range of applications and graph sizes, our prefetcher achieves average speedups of 2.3x, and up to 3.3x, with little impact on memory bandwidth requirements.

[1]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[2]  Athina Markopoulou,et al.  On the bias of BFS (Breadth First Search) , 2010, 2010 22nd International Teletraffic Congress (lTC 22).

[3]  Dan Lin,et al.  SQRL: Hardware accelerator for collecting software data structures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[4]  Ben Coppin,et al.  Artificial Intelligence Illuminated , 2004 .

[5]  Marc Najork,et al.  Breadth-first crawling yields high-quality pages , 2001, WWW '01.

[6]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[7]  Uri C. Weiser,et al.  Loop-Aware Memory Prefetching Using Code Block Working Sets , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Charles E. Leiserson,et al.  A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[9]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Donald Yeung,et al.  Design and evaluation of compiler algorithms for pre-execution , 2002, ASPLOS X.

[11]  MoshovosAndreas,et al.  Dependence based prefetching for linked data structures , 1998 .

[12]  Babak Falsafi,et al.  Meet the walkers accelerating index traversals for in-memory databases , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Ulrich Meyer,et al.  Breadth First Search on Massive Graphs , 2006, The Shortest Path Problem.

[14]  Pradeep Dubey,et al.  Large-scale energy-efficient graph traversal: A path to efficient data-intensive supercomputing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Onur Mutlu,et al.  Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[16]  Chia-Lin Yang,et al.  Push vs. pull: data movement for linked data structures , 2000, ICS '00.

[17]  Dana Ron,et al.  Algorithmic and Analysis Techniques in Property Testing , 2010, Found. Trends Theor. Comput. Sci..

[18]  Uri C. Weiser,et al.  Semantic locality and context-based prefetching using reinforcement learning , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[19]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[20]  Jonathan M. Eastep Evolve : a preliminary multicore architecture for Introspective Computing , 2007 .

[21]  Brian W. Barrett,et al.  Introducing the Graph 500 , 2010 .

[22]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[23]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[24]  Hyesoon Kim,et al.  Spare register aware prefetching for graph algorithms on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[25]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[26]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[27]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[28]  Julius Georgiou,et al.  A hardware-efficient lowpass filter design for biomedical applications , 2010, 2010 Biomedical Circuits and Systems Conference (BioCAS).

[29]  Thomas F. Wenisch,et al.  A Primer on Hardware Prefetching , 2014, A Primer on Hardware Prefetching.

[30]  Peter Kulchyski and , 2015 .

[31]  Kunle Olukotun,et al.  Green-Marl: a DSL for easy and efficient graph analysis , 2012, ASPLOS XVII.

[32]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[33]  Christos Kozyrakis,et al.  Library-based Prefetching for Pointer-intensive Applications , 2006 .

[34]  Valentin Dalibard,et al.  PrefEdge: SSD Prefetcher for Large-Scale Graph Traversal , 2014, SYSTOR 2014.

[35]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[36]  Ivana Cerná,et al.  Distributed breadth-first search LTL model checking , 2006, Formal Methods Syst. Des..

[37]  Daniel A. Connors,et al.  Compiler-directed content-aware prefetching for dynamic data structures , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[38]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[39]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[40]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[41]  Andreas Moshovos,et al.  Dependence based prefetching for linked data structures , 1998, ASPLOS VIII.

[42]  Babak Falsafi,et al.  Dark Silicon Accelerators for Database Indexing , 2012 .

[43]  Shih-Lien Lu,et al.  Hardware-based pointer data prefetcher , 2003, Proceedings 21st International Conference on Computer Design.

[44]  Donald Yeung,et al.  Multicore Performance Optimization Using Partner Cores , 2011, HotPar.

[45]  Jeremy G. Siek,et al.  The Boost Graph Library - User Guide and Reference Manual , 2001, C++ in-depth series.