GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism

Communication is a key bottleneck for distributed graph neural network (GNN) training. This paper proposes GNNPipe, a new approach that scales the distributed full-graph deep GNN training. Being the first to use layer-level model parallelism for GNN training, GNNPipe partitions GNN layers among GPUs, each device performs the computation for a disjoint subset of consecutive GNN layers on the whole graph. Compared to graph parallelism with each GPU handling a graph partition, GNNPipe reduces the communication volume by a factor of the number of GNN layers. GNNPipe overcomes the unique challenges for pipelined layer-level model parallelism on the whole graph by partitioning it into dependent chunks, allowing the use of historical vertex embeddings, and applying specific training techniques to ensure convergence. We also propose a hybrid approach by combining GNNPipe with graph parallelism to handle large graphs, achieve better computer resource utilization and ensure model convergence. We build a general GNN training system supporting all three parallelism setting. Extensive experiments show that our method reduces the per-epoch training time by up to 2.45x (on average 1.58x) and reduces the communication volume and overhead by up to 22.89x and 27.21x (on average 8.69x and 11.60x), respectively, while achieving a comparable level of model accuracy and convergence speed compared to graph parallelism.

[1]  Yanyan Shen,et al.  DUCATI: A Dual-Cache Training System for Graph Neural Networks on Giant Graphs with the GPU , 2023, Proc. ACM Manag. Data.

[2]  Xudong Liao,et al.  Scalable and Efficient Full-Graph GNN Training for Large Graphs , 2023, Proc. ACM Manag. Data.

[3]  Jun Zhao,et al.  Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training , 2023, ArXiv.

[4]  M. Serafini,et al.  GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism , 2023, ArXiv.

[5]  Wentao Zhang,et al.  Distributed Graph Neural Network Training: A Survey , 2022, ACM Comput. Surv..

[6]  Yuke Wang,et al.  MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms , 2022, OSDI.

[7]  Jiannong Cao,et al.  SANCUS: Staleness-Aware Communication-Avoiding Full-Graph Decentralized Training in Large-Scale Graph Neural Networks , 2022, Proc. VLDB Endow..

[8]  Jingren Zhou,et al.  GNNLab: a factored system for sample-based GNN training over GPUs , 2022, EuroSys.

[9]  Youjie Li,et al.  BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling , 2022, MLSys.

[10]  Cameron R. Wolfe,et al.  PipeGCN: Efficient Full-Graph Training of Graph Convolutional Networks with Pipelined Feature Communication , 2022, ICLR.

[11]  Li,et al.  ByteGNN: Efficient Graph Neural Network Training at Large Scale , 2022, Proc. VLDB Endow..

[12]  Rajgopal Kannan,et al.  Decoupling the Depth and Scope of Graph Neural Networks , 2022, NeurIPS.

[13]  Dan Li,et al.  BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing , 2021, NSDI.

[14]  H. Mostafa Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs , 2021, MLSys.

[15]  T. Hoefler,et al.  Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  V. Koltun,et al.  Training Graph Neural Networks with 1000 Layers , 2021, ICML.

[17]  Jure Leskovec,et al.  GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings , 2021, ICML.

[18]  Depei Qian,et al.  Distributed Graph Processing System and Processing-in-memory Architecture with Precise Loop-carried Dependency Guarantee , 2021, ACM Trans. Comput. Syst..

[19]  Miryung Kim,et al.  Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads , 2021, OSDI.

[20]  G. Karypis,et al.  DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs , 2020, 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3).

[21]  William L. Hamilton Graph Representation Learning , 2020, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[22]  J. Leskovec,et al.  Neural Subgraph Matching , 2020, ArXiv.

[23]  Yaliang Li,et al.  Simple and Deep Graph Convolutional Networks , 2020, ICML.

[24]  Chuan Wu,et al.  DAPPLE: a pipelined data parallel approach for training large models , 2020, PPoPP.

[25]  D. Narayanan,et al.  Memory-Efficient Pipeline-Parallel DNN Training , 2020, ICML.

[26]  Bernard Ghanem,et al.  DeeperGCN: All You Need to Train Deeper GCNs , 2020, ArXiv.

[27]  Xiao Huang,et al.  Towards Deeper Graph Neural Networks with Differentiable Group Normalization , 2020, NeurIPS.

[28]  Xiaoning Qian,et al.  Bayesian Graph Neural Networks with Adaptive Connection Sampling , 2020, ICML.

[29]  Alexander Aiken,et al.  Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc , 2020, MLSys.

[30]  Yizhou Sun,et al.  Layer-Dependent Importance Sampling for Training Deep and Large Graph Convolutional Networks , 2019, NeurIPS.

[31]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[32]  Ali K. Thabet,et al.  DeepGCNs: Making GCNs Go as Deep as CNNs , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Rik Sarkar,et al.  Multi-scale Attributed Node Embedding , 2019, J. Complex Networks.

[34]  G. Karypis,et al.  Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. , 2019 .

[35]  Keshav Pingali,et al.  Gluon-Async: A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[36]  Junzhou Huang,et al.  DropEdge: Towards Deep Graph Convolutional Networks on Node Classification , 2019, ICLR.

[37]  Rajgopal Kannan,et al.  GraphSAINT: Graph Sampling Based Inductive Learning Method , 2019, ICLR.

[38]  Yafei Dai,et al.  NeuGraph: Parallel Deep Neural Network Computation on Large Graphs , 2019, USENIX ATC.

[39]  Samy Bengio,et al.  Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks , 2019, KDD.

[40]  Bernard Ghanem,et al.  DeepGCNs: Can GCNs Go As Deep As CNNs? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[42]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[43]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, NeurIPS.

[44]  Stephan Günnemann,et al.  Pitfalls of Graph Neural Network Evaluation , 2018, ArXiv.

[45]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[46]  Alex Brooks,et al.  Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics , 2018, PLDI.

[47]  Ken-ichi Kawarabayashi,et al.  Representation Learning on Graphs with Jumping Knowledge Networks , 2018, ICML.

[48]  Cao Xiao,et al.  FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling , 2018, ICLR.

[49]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[50]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[51]  Le Song,et al.  Stochastic Training of Graph Convolutional Networks with Variance Reduction , 2017, ICML.

[52]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[53]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[54]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[55]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[57]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[60]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[61]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[63]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[64]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[65]  Jingren Zhou,et al.  Bridging the Gap between Relational OLTP and Graph-based OLAP , 2023, USENIX Annual Technical Conference.

[66]  Weimin Zheng,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[67]  Carlos Guestrin,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[68]  Microsoft Research,et al.  This paper is included in the Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation , 2022 .