Accelerating Distributed GNN Training by Codes

Emerging graph neural network (GNN) has recently attracted much attention and has been used extensively in many real-world applications thanks to its powerful expression ability of unstructured data. The real-world graph datasets are very large-scale, which can contain up to billions of nodes and tens of billions of edges. It usually requires distributed system to train GNN on such huge datasets. As a result, the data communication overheads between machines become the bottleneck of GNN computation. Our profiling results show that getting attributes from remote machines during sampling phase in GNN occupies $> $>75% of the time of the training process. To address this issue, in this article, we propose Coded Neighbor Sampling (CNS) framework, which introduces codes technique to reduce the communication overheads of GNN. In the proposed CNS framework, the codes technique is coupled with GNN sampling method to exploit the data excess among different machines caused by unstructured nature of graph data. An analytical performance model is built for the proposed CNS framework, whose results are corroborated by the simulation and validate the benefit of the proposed CNS framework over both conventional GNN training method and conventional codes technique. Performance metrics, such as communication overheads, runtime, and throughput, of the proposed CNS framework are evaluated on a distributed GNN training simulation system implemented on MPI4py platform. The results show that, on average, the proposed CNS framework can save communication overhead by 40.6%, 35.5%, and 16.5%, reduce the runtime by 12.1%, 17.0%, and 10.0%, and improve the throughput by 16.2%, 24.4%, and 11.2%, respectively, when training GNN models with Cora, PubMed, and Large Taobao.

[1]  Geoffrey X. Yu,et al.  NeutronStar: Distributed GNN Training with Hybrid Dependency Management , 2022, SIGMOD Conference.

[2]  Yu Gu,et al.  EC-Graph: A Distributed Graph Neural Network System with Error-Compensated Compression , 2022, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[3]  Juan A. Rico-Gallego,et al.  Model-based selection of optimal MPI broadcast algorithms for multi-core clusters , 2022, J. Parallel Distributed Comput..

[4]  Imran Razzak,et al.  Distributed Optimization of Graph Convolutional Network using Subgraph Variance , 2021, IEEE transactions on neural networks and learning systems.

[5]  Yunxin Liu,et al.  Efficient Data Loader for Fast Sampling-Based GNN Training on Large Graphs , 2021, IEEE Transactions on Parallel and Distributed Systems.

[6]  Guangyu Sun,et al.  GNNSampler: Bridging the Gap between Sampling Algorithms of GNN and Hardware , 2021, ECML/PKDD.

[7]  Aritra Dutta,et al.  GRACE: A Compressed Communication Framework for Distributed Machine Learning , 2021, 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).

[8]  Miryung Kim,et al.  Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads , 2021, OSDI.

[9]  James Cheng,et al.  Seastar: vertex-centric programming for graph neural networks , 2021, EuroSys.

[10]  James Cheng,et al.  DGCL: an efficient communication library for distributed GNN training , 2021, EuroSys.

[11]  Dhiraj D. Kalamkar,et al.  DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Chao Li,et al.  BTReader: A High-Performance Fault Tolerant Self-Adaptive Broadcast Framework , 2020, 2020 7th International Conference on Information Science and Control Engineering (ICISCE).

[13]  Yunxin Liu,et al.  PaGraph: Scaling GNN training on large graphs via computation-aware caching , 2020, SoCC.

[14]  G. Karypis,et al.  DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs , 2020, 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3).

[15]  Minjie Wang,et al.  FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Chunyan Miao,et al.  A Survey of Coded Distributed Computing , 2020, ArXiv.

[17]  Shen-Fu Hsiao,et al.  Design of a Sparsity-Aware Reconfigurable Deep Learning Accelerator Supporting Various Types of Operations , 2020, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

[18]  Bingsheng He,et al.  G3 , 2020, Proc. VLDB Endow..

[19]  Antonio Ortega,et al.  TACC: Topology-Aware Coded Computing for Distributed Graph Processing , 2020, IEEE Transactions on Signal and Information Processing over Networks.

[20]  Alexander Aiken,et al.  Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc , 2020, MLSys.

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  Onur Mutlu,et al.  SMASH: Co-designing Software Compression and Hardware-Accelerated Indexing for Efficient Sparse Matrix Operations , 2019, MICRO.

[23]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[24]  James T. Kwok,et al.  Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.

[25]  Vladimir Braverman,et al.  Communication-efficient distributed SGD with Sketching , 2019, NeurIPS.

[26]  Chang Zhou,et al.  AliGraph: A Comprehensive Graph Neural Network Platform , 2019, Proc. VLDB Endow..

[27]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[28]  Shaohuai Shi,et al.  MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms , 2018, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[29]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[30]  Junzhou Huang,et al.  Adaptive Sampling Towards Fast Graph Representation Learning , 2018, NeurIPS.

[31]  Chapram Sudhakar,et al.  Path Based Optimization of MPI Collective Communication Operation in Cloud , 2018, 2018 International Conference on Computing, Power and Communication Technologies (GUCON).

[32]  Tong Yang,et al.  SketchML: Accelerating Distributed Machine Learning with Data Sketches , 2018, SIGMOD Conference.

[33]  Hyeontaek Lim,et al.  3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning , 2018, MLSys.

[34]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[35]  Cao Xiao,et al.  FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling , 2018, ICLR.

[36]  Amir Salman Avestimehr,et al.  Coded Computing for Distributed Graph Analytics , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[37]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[38]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[39]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[40]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[41]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[42]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[43]  Kannan Ramchandran,et al.  Speeding Up Distributed Machine Learning Using Codes , 2015, IEEE Transactions on Information Theory.

[44]  Tim Dettmers,et al.  8-Bit Approximations for Parallelism in Deep Learning , 2015, ICLR.

[45]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[46]  Shinji Tanimoto,et al.  Power laws of the in-degree and out-degree distributions of complex networks , 2009, 0912.2793.

[47]  Andrzej Rucinski,et al.  Random graphs , 2006, SODA.

[48]  Jiannong Cao,et al.  SANCUS: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks , 2022, Proc. VLDB Endow..

[49]  Li,et al.  ByteGNN: Efficient Graph Neural Network Training at Large Scale , 2022, Proc. VLDB Endow..

[50]  Anand Padmanabha Iyer,et al.  P3: Distributed Deep Graph Learning at Scale , 2021, OSDI.

[51]  Jichan Chung UberShuffle: Communication-efficient Data Shuffling for SGD via Coding Theory , 2017 .

[52]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.