Making pull-based graph processing performant

Graph processing engines following either the push-based or pull-based pattern conceptually consist of a two-level nested loop structure. Parallelizing and vectorizing these loops is critical for high overall performance and memory bandwidth utilization. Outer loop parallelization is simple for both engine types but suffers from high load imbalance. This work focuses on inner loop parallelization for pull engines, which when performed naively leads to a significant increase in conflicting memory writes that must be synchronized. Our first contribution is a scheduler-aware interface for parallel loops that allows us to optimize for the common case in which each thread executes several consecutive iterations. This eliminates most write traffic and avoids all synchronization, leading to speedups of up to 50X. Our second contribution is the Vector-Sparse format, which addresses the obstacles to vectorization that stem from the commonly-used Compressed-Sparse data structure. Our new format eliminates unaligned memory accesses and bounds checks within vector operations, two common problems when processing low-degree vertices. Vectorization with Vector-Sparse leads to speedups of up to 2.5X. Our contributions are embodied in Grazelle, a hybrid graph processing framework. On a server equipped with four Intel Xeon E7-4850 v3 processors, Grazelle respectively outperforms Ligra, Polymer, GraphMat, and X-Stream by up to 15.2X, 4.6X, 4.7X, and 66.8X.

[1]  Trishul M. Chilimbi Efficient representations and abstractions for quantifying and exploiting data reference locality , 2001, PLDI '01.

[2]  Alexander S. Szalay,et al.  FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs , 2014, FAST.

[3]  Barbara M. Chapman,et al.  A Runtime Implementation of OpenMP Tasks , 2011, IWOMP.

[4]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Alexandru Iosup,et al.  How Well Do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[6]  Eli Upfal,et al.  A simple load balancing scheme for task allocation in parallel machines , 1991, SPAA '91.

[7]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[8]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[9]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Katherine A. Yelick,et al.  Optimizing parallel programs with explicit synchronization , 1995, PLDI '95.

[11]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[12]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[13]  Pradeep Dubey,et al.  GraphMat: High performance graph analytics made productive , 2015, Proc. VLDB Endow..

[14]  Chau-Wen Tseng,et al.  A Comparison of Locality Transformations for Irregular Codes , 2000, LCR.

[15]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[16]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[17]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[18]  D. Patterson,et al.  Searching for a Parent Instead of Fighting Over Children : A Fast Breadth-First Search Implementation for Graph 500 , 2011 .

[19]  David A. Patterson,et al.  Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server , 2015, 2015 IEEE International Symposium on Workload Characterization.

[20]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[21]  Panos Kalnis,et al.  Mizan: a system for dynamic load balancing in large-scale graph processing , 2013, EuroSys '13.

[22]  Lu Yao,et al.  Implementing Sparse Matrix-Vector multiplication using CUDA based on a hybrid sparse matrix format , 2010, 2010 International Conference on Computer Application and System Modeling (ICCASM 2010).

[23]  Weimin Zheng,et al.  Exploring the Hidden Dimension in Graph Processing , 2016, OSDI.

[24]  Mario Szegedy,et al.  A Simple Yet Effective Balanced Edge Partition Model for Parallel Computing , 2017, SIGMETRICS.

[25]  Haibo Chen,et al.  SYNC or ASYNC: time to fuse for distributed graph-parallel computation , 2015, PPoPP.

[26]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[27]  Chen Ding,et al.  Program locality analysis using reuse distance , 2009, TOPL.

[28]  Alberto Montresor,et al.  An evaluation study of BigData frameworks for graph processing , 2013, 2013 IEEE International Conference on Big Data.

[29]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[30]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[31]  Arutyun Avetisyan,et al.  Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures , 2010, HiPEAC.

[32]  Willy Zwaenepoel,et al.  Chaos: scale-out graph processing from secondary storage , 2015, SOSP.

[33]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[34]  Pavel Tvrdík,et al.  Evaluation Criteria for Sparse Matrix Storage Formats , 2016, IEEE Transactions on Parallel and Distributed Systems.

[35]  Francisco F. Rivera,et al.  Exploiting locality in the run-time parallelization of irregular loops , 2002, Proceedings International Conference on Parallel Processing.

[36]  Wencong Xiao,et al.  GraM: scaling graph computation to the trillions , 2015, SoCC.

[37]  Dimitrios S. Nikolopoulos,et al.  Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[38]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[39]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[40]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[41]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[42]  Chen Ding,et al.  Software behavior oriented parallelization , 2007, PLDI '07.

[43]  Anthony P. Reeves,et al.  Strategies for Dynamic Load Balancing on Highly Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[44]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[45]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[46]  Jennifer Widom,et al.  GPS: a graph processing system , 2013, SSDBM.

[47]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[48]  John R. Gilbert,et al.  Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks , 2009, SPAA '09.

[49]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[50]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[51]  Chau-Wen Tseng,et al.  Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.

[52]  Juan Touriño,et al.  An Inspector-Executor Algorithm for Irregular Assignment Parallelization , 2004, ISPA.

[53]  Ming Wu,et al.  Managing Large Graphs on Multi-Cores with Graph Awareness , 2012, USENIX Annual Technical Conference.

[54]  Christos Faloutsos,et al.  PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[55]  M. Tamer Özsu,et al.  An Experimental Comparison of Pregel-like Graph Processing Systems , 2014, Proc. VLDB Endow..

[56]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[57]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[58]  Haibo Chen,et al.  NUMA-aware graph-structured analytics , 2015, PPoPP.

[59]  Lawrence Rauchwerger,et al.  The R-LRPD test: speculative parallelization of partially parallel loops , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[60]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[61]  Samuel Williams,et al.  Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.