Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems

The increasing scale and wealth of inter-connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable knowledge from large-scale graphs. However, real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint, but also most graph algorithms entail memory access patterns with poor locality, data-dependent parallelism and a low compute-to-memory access ratio. Moreover, most real-world graphs have a highly heterogeneous node degree distribution, hence partitioning these graphs for parallel processing and simultaneously achieving access locality and load-balancing is difficult. This work starts from the hypothesis that hybrid platforms (e.g., GPU-accelerated systems) have both the potential to cope with the heterogeneous structure of real graphs and to offer a cost-effective platform for high-performance graph processing. This work assesses this hypothesis and presents an extensive exploration of the opportunity to harness hybrid systems to process large-scale graphs efficiently. In particular, (i) we present a performance model that estimates the achievable performance on hybrid platforms; (ii) informed by the performance model, we design and develop TOTEM - a processing engine that provides a convenient environment to implement graph algorithms on hybrid platforms; (iii) we show that further performance gains can be extracted using partitioning strategies that aim to produce partitions that each matches the strengths of the processing element it is allocated to, finally, (iv) we demonstrate the performance advantages of the hybrid system through a comprehensive evaluation that uses real and synthetic workloads (as large as 16 billion edges), multiple graph algorithms that stress the system in various ways, and a variety of hardware configurations.

[1]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[2]  T. Lindvall ON A ROUTING PROBLEM , 2004, Probability in the Engineering and Informational Sciences.

[3]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[4]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Vladimiro Sassone,et al.  Mathematical Foundations of Computer Science 2012 , 2012, Lecture Notes in Computer Science.

[6]  David A. Patterson,et al.  Direction-optimizing breadth-first search , 2012, HiPC 2012.

[7]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Michael Pinedo,et al.  Scheduling: Theory, Algorithms, and Systems , 1994 .

[9]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[10]  B. Bollobás The evolution of random graphs , 1984 .

[11]  Guanrong Chen,et al.  Complex networks: small-world, scale-free and beyond , 2003 .

[12]  J. Rodgers,et al.  Thirteen ways to look at the correlation coefficient , 1988 .

[13]  Chao Yang,et al.  Unicorn: A System for Searching the Social Graph , 2013, Proc. VLDB Endow..

[14]  Michael Garland,et al.  Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[15]  Richard Barrett,et al.  Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods , 1994, Other Titles in Applied Mathematics.

[16]  Abdullah Gharaibeh,et al.  The energy case for graph processing on hybrid CPU and GPU systems , 2013, IA3 '13.

[17]  Michela Becchi,et al.  Deploying Graph Algorithms on GPUs: An Adaptive Solution , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[18]  Pradeep Dubey,et al.  Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[19]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[20]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[21]  Albert-László Barabási,et al.  Linked - how everything is connected to everything else and what it means for business, science, and everyday life , 2003 .

[22]  Jimmy J. Lin,et al.  WTF: the who to follow service at Twitter , 2013, WWW.

[23]  Matei Ripeanu,et al.  Exploring Hybrid Hardware and Data Placement Strategies for the Graph 500 Challenge , 2014 .

[24]  A. Barabasi,et al.  Scale-free characteristics of random networks: the topology of the world-wide web , 2000 .

[25]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[26]  G. Caldarelli,et al.  A Network Analysis of the Italian Overnight Money Market , 2005 .

[27]  Alois Goller,et al.  Parallel and Distributed Processing , 1998, Lecture Notes in Computer Science.

[28]  Rui Wang,et al.  Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph , 2013, HotCloud.

[29]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[30]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[31]  Carl Staelin,et al.  lmbench: Portable Tools for Performance Analysis , 1996, USENIX Annual Technical Conference.

[32]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[33]  Scott Lathrop,et al.  Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis , 2011, International Conference on High Performance Computing.

[34]  David S. Johnson,et al.  Some simplified NP-complete problems , 1974, STOC '74.

[35]  Antony Rowstron,et al.  Nobody ever got fired for using Hadoop on a cluster , 2012, HotCDP '12.

[36]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[37]  Ulrich Meyer,et al.  [Delta]-stepping: a parallelizable shortest path algorithm , 2003, J. Algorithms.

[38]  Andreas Emil Feldmann Fast Balanced Partitioning Is Hard Even on Grids and Trees , 2012, MFCS.

[39]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[40]  Bradford L. Chamberlain,et al.  Graph Partitioning Algorithms for Distributing Workloads of Parallel Computations , 2001 .

[41]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[42]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[43]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[44]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[45]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[46]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[47]  John M. Chambers,et al.  Algorithm 410: Partial sorting , 1971 .

[48]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[49]  Matei Ripeanu,et al.  On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[50]  Andreas Emil Feldmann,et al.  Fast balanced partitioning is hard even on grids and trees , 2011, Theor. Comput. Sci..

[51]  L. R. Ford,et al.  NETWORK FLOW THEORY , 1956 .

[52]  Sebastiano Vigna,et al.  A large time-aware web graph , 2008, SIGF.

[53]  Brian W. Kernighan,et al.  An efficient heuristic procedure for partitioning graphs , 1970, Bell Syst. Tech. J..

[54]  Joseph T. Kider,et al.  All-pairs shortest-paths for large graphs on the GPU , 2008, GH '08.

[55]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[56]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[57]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[58]  Shirish Tatikonda,et al.  From "Think Like a Vertex" to "Think Like a Graph" , 2013, Proc. VLDB Endow..

[59]  David A. Bader,et al.  STINGER: High performance data structure for streaming graphs , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[60]  G. Caldarelli,et al.  A Network Analysis of the Italian Overnight Money Market , 2005 .

[61]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.