On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest

Graph processing has gained renewed attention. The increasing large scale and wealth of connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable information from large scale graphs. Hybrid systems that host processing units optimized for both fast sequential processing and bulk processing (e.g., GPUaccelerated systems) have the potential to cope with the heterogeneous structure of real graphs and enable high performance graph processing. Reaching this point, however, poses multiple challenges. The heterogeneity of the processing elements (e.g., GPUs implement a different parallel processing model than CPUs and have much less memory) and the inherent irregularity of graph workloads require careful graph partitioning and load assignment. In particular, the workload generated by a partitioning scheme should match the strength of the processing element the partition is allocated to. This work explores the feasibility and quantifies the performance gains of such low-cost partitioning schemes. We propose to partition the workload between the two types of processing elements based on vertex connectivity. We show that such partitioning schemes offer a simple, yet efficient way to boost the overall performance of the hybrid system. Our evaluation illustrates that processing a 4-billion edges graph on a system with one CPU socket and one GPU, while offloading as little as 25% of the edges to the GPU, achieves 2x performance improvement over state-of-the-art implementations running on a dual-socket symmetric system. Moreover, for the same graph, a hybrid system with dualsocket and dual-GPU is capable of 1.13 Billion breadth-first search traversed edge per second, a performance rate that is competitive with the latest entries in the Graph500 list, yet at a much lower price point.

[1]  Andrew A. Chien,et al.  10x10: A General-purpose Architectural Approach to Heterogeneity and Energy Efficiency , 2011, ICCS.

[2]  John M. Chambers,et al.  Algorithm 410: Partial sorting , 1971 .

[3]  Pradeep Dubey,et al.  Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[4]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[5]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[6]  S.,et al.  An Efficient Heuristic Procedure for Partitioning Graphs , 2022 .

[7]  Albert-László Barabási,et al.  Linked - how everything is connected to everything else and what it means for business, science, and everyday life , 2003 .

[8]  A. Barabasi,et al.  Scale-free characteristics of random networks: the topology of the world-wide web , 2000 .

[9]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[10]  Joshua B. Tenenbaum,et al.  The Large-Scale Structure of Semantic Networks: Statistical Analyses and a Model of Semantic Growth , 2001, Cogn. Sci..

[11]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[12]  Guanrong Chen,et al.  Complex networks: small-world, scale-free and beyond , 2003 .

[13]  David A. Bader,et al.  Scalable Graph Exploration on Multicore Processors , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  B. Bollobás The evolution of random graphs , 1984 .

[15]  G. Caldarelli,et al.  A Network Analysis of the Italian Overnight Money Market , 2005 .

[16]  Tamara G. Kolda,et al.  Community structure and scale-free collections of Erdös-Rényi graphs , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[17]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[18]  Matei Ripeanu,et al.  A yoke of oxen and a thousand chickens for heavy lifting graph processing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Andreas Emil Feldmann Fast Balanced Partitioning Is Hard Even on Grids and Trees , 2012, MFCS.

[20]  Bradford L. Chamberlain,et al.  Graph Partitioning Algorithms for Distributing Workloads of Parallel Computations , 2001 .

[21]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[22]  Joseph T. Kider,et al.  All-pairs shortest-paths for large graphs on the GPU , 2008, GH '08.

[23]  Kunle Olukotun,et al.  Accelerating CUDA graph algorithms at maximum warp , 2011, PPoPP '11.

[24]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[25]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[26]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[27]  Seungyeop Han,et al.  Analysis of topological characteristics of huge online social networking services , 2007, WWW '07.

[28]  David S. Johnson,et al.  Some simplified NP-complete problems , 1974, STOC '74.