A Lightweight Communication Runtime for Distributed Graph Analytics

Distributed-memory multi-core clusters enable in-memory processing of very large graphs with billions of nodes and edges. Recent distributed graph analytics systems have been built on top of MPI. However, communication in graph applications is very irregular, and each host exchanges different amounts of non-contiguous data with other hosts. MPI does not support such a communication pattern well, and it has limited ability to integrate communication with serialization, deserialization, and graph computation tasks. In this paper, we describe a lightweight communication runtime called LCI that supports a large number of threads on each host and avoids the semantic mismatches between the requirements of graph computations and the communication library in MPI. The implementation of LCI is informed by lessons learnt from two baseline MPI-based implementations. We have successfully integrated LCI with two state-of-the-art graph analytics systems - Gemini and Abelian. LCI improves the latency up to 3.5x for microbenchmarks compared to MPI solutions and improves the end-to-end performance of distributed graph algorithms by up to 2x.

[1]  Keshav Pingali,et al.  A lightweight infrastructure for graph analytics , 2013, SOSP.

[2]  Reynold Xin,et al.  GraphX: a resilient distributed graph system on Spark , 2013, GRADES.

[3]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[4]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[5]  Wencong Xiao,et al.  GraM: scaling graph computation to the trillions , 2015, SoCC.

[6]  Alex Brooks,et al.  Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics , 2018, PLDI.

[7]  Simon D. Hammond,et al.  An evaluation of MPI message rate on hybrid-core processors , 2014, Int. J. High Perform. Comput. Appl..

[8]  Yehuda Afek,et al.  Fast concurrent queues for x86 processors , 2013, PPoPP '13.

[9]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[10]  Pavan Balaji,et al.  Advanced Thread Synchronization for Multithreaded MPI Implementations , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[11]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[12]  Pavan Balaji,et al.  Improving concurrency and asynchrony in multithreaded MPI applications using software offloading , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[14]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[15]  Bryan Carpenter,et al.  ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems , 1999, IPPS/SPDP Workshops.

[16]  Jacob Nelson,et al.  Latency-Tolerant Software Distributed Shared Memory , 2015, USENIX ATC.

[17]  Keshav Pingali,et al.  The tao of parallelism in algorithms , 2011, PLDI '11.

[18]  Keshav Pingali,et al.  Parallel graph analytics , 2016, Commun. ACM.

[19]  Torsten Hoefler,et al.  AM++: A generalized active message framework , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Dhabaleswar K. Panda,et al.  Scalable Graph500 design with MPI-3 RMA , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[21]  Linyuan Lu,et al.  The diameter of random massive graphs , 2001, SODA '01.

[22]  Sungpack Hong,et al.  PGX.D: a fast distributed graph processing engine , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[24]  Sivasankaran Rajamanickam,et al.  Scalable matrix computations on large scale-free graphs using 2D graph partitioning , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Sayantan Sur,et al.  A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[26]  Nancy Wilkins-Diehr,et al.  XSEDE: Accelerating Scientific Discovery , 2014, Computing in Science & Engineering.

[27]  Torsten Hoefler,et al.  sPIN: High-performance streaming Processing in the Network , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  William Gropp,et al.  Towards millions of communicating threads , 2016, EuroMPI.

[29]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[30]  Rajeev Thakur,et al.  An implementation and evaluation of the MPI 3.0 one‐sided communication interface , 2016, Concurr. Comput. Pract. Exp..

[31]  Torsten Hoefler,et al.  Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages , 2015, HPDC.

[32]  George Bosilca,et al.  UCX: An Open Source Framework for HPC Network APIs and Beyond , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.