A Distributed Multi-GPU System for Fast Graph Processing

We present Lux, a distributed multi-GPU system that achieves fast graph processing by exploiting the aggregate memory bandwidth of multiple GPUs and taking advantage of locality in the memory hierarchy of multi-GPU clusters. Lux provides two execution models that optimize algorithmic efficiency and enable important GPU optimizations, respectively. Lux also uses a novel dynamic load balancing strategy that is cheap and achieves good load balance across GPUs. In addition, we present a performance model that quantitatively predicts the execution times and automatically selects the runtime configurations for Lux applications. Experiments show that Lux achieves up to 20× speedup over state-of-the-art shared memory systems and up to two orders of magnitude speedup over distributed systems. PVLDB Reference Format: Zhihao Jia, Yongkee Kwon, Galen Shipman, Pat McCormick, Mattan Erez, and Alex Aiken. A Distributed Multi-GPU System for Fast Graph Processing. PVLDB, 11(3): 297 310, 2017. DOI: 10.14778/3157794.3157799

[1]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[2]  Christos Faloutsos,et al.  R-MAT: A Recursive Model for Graph Mining , 2004, SDM.

[3]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[4]  James Bennett,et al.  The Netflix Prize , 2007 .

[5]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[6]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[7]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[8]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[9]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[11]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[12]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[13]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[14]  David A. Patterson,et al.  Direction-optimizing Breadth-First Search , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Aurélien Esnard,et al.  Dynamic load-balancing with variable number of processors based on graph repartitioning , 2012, 2012 19th International Conference on High Performance Computing.

[16]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[17]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[18]  Lin Ma,et al.  PAGE: a partition aware graph computation engine , 2013, CIKM.

[19]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[20]  Alvin AuYoung,et al.  Presto: distributed machine learning and graph processing with sparse matrices , 2013, EuroSys '13.

[21]  Keval Vora,et al.  CuSha: vertex-centric graph processing on GPUs , 2014, HPDC '14.

[22]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[23]  Chang Zhou,et al.  MOCgraph: Scalable Distributed Graph Processing Using Message Online Computing , 2014, Proc. VLDB Endow..

[24]  Alexander Aiken,et al.  Realm: An event-based low-level runtime for distributed memory architectures , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[25]  Pradeep Dubey,et al.  Navigating the maze of graph analytics frameworks using massive graph datasets , 2014, SIGMOD Conference.

[26]  Zhisong Fu,et al.  MapGraph: A High Level API for Fast Development of High Performance Graph Analytics on GPUs , 2014, GRADES.

[27]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[28]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[29]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[30]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[31]  Haibo Chen,et al.  NUMA-aware graph-structured analytics , 2015, PPoPP.

[32]  Sungpack Hong,et al.  PGX.D: a fast distributed graph processing engine , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Feifei Li,et al.  Graph Analytics Through Fine-Grained Parallelism , 2016, SIGMOD Conference.

[34]  Jinwook Kim,et al.  GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs , 2016, SIGMOD Conference.

[35]  Keshav Pingali,et al.  Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations , 2017, PPoPP.