Optimizing Cache Performance for Graph Analytics

Modern hardware systems are heavily underutilized when running large-scale graph applications. While many in-memory graph frameworks have made substantial progress in optimizing these applications, we show that it is still possible to achieve up to 4 $\times$ speedups over the fastest frameworks by greatly improving cache utilization. Previous systems have applied out-of-core processing techniques from the memory/disk boundary to the cache/DRAM boundary. However, we find that blindly applying such techniques is ineffective because of the much smaller performance gap between DRAM and cache. We present two techniques that take advantage of the cache with minimal or no instruction overhead. The first, frequency based clustering, groups together frequently accessed vertices to improve the utilization of each cache line with no runtime overhead. The second, CSR segmenting, partitions the graph to restrict all random accesses to the cache, makes all DRAM access sequential, and merges partition results using a very low overhead cache-aware merge. Both techniques can be easily implemented on top of optimized graph frameworks. Our techniques combined give speedups of up to 4 $\times$ for PageRank, Label Propagation and Collaborative Filtering, and 2 $\times$ for Betweenness Centrality over the best published results

[1]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[2]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[3]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, SIAM Conference on Parallel Processing for Scientific Computing.

[4]  E. Im,et al.  Optimizing Sparse Matrix Vector Multiplication on SMP , 1999, PPSC.

[5]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[6]  Guy E. Blelloch,et al.  Compact representations of separable graphs , 2003, SODA '03.

[7]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[8]  James Bennett,et al.  The Netflix Prize , 2007 .

[9]  Samuel Williams,et al.  Optimization of sparse matrix-vector multiplication on emerging multicore platforms , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[10]  Manfred Liebmann,et al.  A Hilbert-order multiplication scheme for unstructured sparse matrices , 2007, Int. J. Parallel Emergent Distributed Syst..

[11]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[12]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[13]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[14]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[15]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[16]  A. N. Yzelman,et al.  A Cache-Oblivious Sparse Matrix–Vector Multiplication Scheme Based on the Hilbert Curve , 2012 .

[17]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[18]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[19]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[21]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[22]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[23]  Monica S. Lam,et al.  SociaLite: Datalog extensions for efficient social network analysis , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[24]  Yannis Sismanis,et al.  Sparkler: supporting large-scale matrix factorization , 2013, EDBT '13.

[25]  Jinha Kim,et al.  TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC , 2013, KDD.

[26]  Dirk Roose,et al.  High-level strategies for parallel shared-memory sparse matrix – vector multiplication , 2012 .

[27]  Curtis E. Dyreson,et al.  Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data , 2014, SIGMOD 2014.

[28]  Satoshi Matsuoka,et al.  Cache-aware sparse matrix formats for Kepler GPU , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[29]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[30]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[31]  Alexander S. Szalay,et al.  FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs , 2014, FAST.

[32]  Julian Shun,et al.  Multicore triangle computations without tuning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[33]  Rong Chen,et al.  PowerLyra: differentiated graph computation and partitioning on skewed graphs , 2015, EuroSys.

[34]  Guy E. Blelloch,et al.  Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[35]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[36]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[37]  Haibo Chen,et al.  NUMA-aware graph-structured analytics , 2015, PPoPP.

[38]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[39]  Xuemin Lin,et al.  Speedup Graph Processing by Graph Ordering , 2016, SIGMOD Conference.

[40]  Yunming Zhang,et al.  Optimizing indirect memory references with milk , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[41]  David A. Patterson,et al.  Reducing Pagerank Communication via Propagation Blocking , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).