论文信息 - NeutronStar: Distributed GNN Training with Hybrid Dependency Management

NeutronStar: Distributed GNN Training with Hybrid Dependency Management

GNN's training needs to resolve issues of vertex dependencies, i.e., each vertex representation's update depends on its neighbors. Existing distributed GNN systems adopt either a dependencies-cached approach or a dependencies-communicated approach. Having made intensive experiments and analysis, we find that a decision to choose one or the other approach for the best performance is determined by a set of factors, including graph inputs, model configurations, and an underlying computing cluster environment. If various GNN trainings are supported solely by one approach, the performance results are often suboptimal. We study related factors for each GNN training before its execution to choose the best-fit approach accordingly. We propose a hybrid dependency-handling approach that adaptively takes the merits of the two approaches at runtime. Based on the hybrid approach, we further develop a distributed GNN training system called NeutronStar, which makes high performance GNN trainings in an automatic way. NeutronStar is also empowered by effective optimizations in CPU-GPU computation and data processing. Our experimental results on 16-node Aliyun cluster demonstrate that NeutronStar achieves 1.81X-14.25X speedup over existing GNN systems including DistDGL and ROC.

[1] Fei Sun,et al. Graph Neural Networks in Recommender Systems: A Survey , 2020, ACM Comput. Surv..

[2] Lei Zou,et al. Accelerating Triangle Counting on GPU , 2021, SIGMOD Conference.

[3] Miryung Kim,et al. Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads , 2021, OSDI.

[4] James Cheng,et al. DGCL: an efficient communication library for distributed GNN training , 2021, EuroSys.

[5] James Cheng,et al. Seastar: vertex-centric programming for graph neural networks , 2021, EuroSys.

[6] Wenyuan Yu,et al. FlexGraph: a flexible and efficient distributed framework for GNN training , 2021, EuroSys.

[7] Yongchao Liu,et al. GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy , 2021, ArXiv.

[8] Dhiraj D. Kalamkar,et al. DistGNN: Scalable Distributed Training for Large-Scale Graph Neural Networks , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Wen-mei W. Hwu,et al. Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture , 2021, Proc. VLDB Endow..

[10] Lei Deng,et al. GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs , 2020, OSDI.

[11] Lei Chen. Deep Learning and Practice with MindSpore , 2021 .

[12] Anand Padmanabha Iyer,et al. P3: Distributed Deep Graph Learning at Scale , 2021, OSDI.

[13] G. Karypis,et al. DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs , 2020, 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms (IA3).

[14] Bingsheng He,et al. G3 , 2020, Proc. VLDB Endow..

[15] Shen Li,et al. PyTorch distributed , 2020, Proc. VLDB Endow..

[16] Ge Yu,et al. Automating Incremental and Asynchronous Evaluation for Recursive Aggregate Data Processing , 2020, SIGMOD Conference.

[17] K. Yelick,et al. Reducing Communication in Graph Neural Network Training , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Alexander Aiken,et al. Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc , 2020, MLSys.

[19] Ziqi Liu,et al. AGL , 2020, Proc. VLDB Endow..

[20] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[21] Douwe Kiela,et al. Hyperbolic Graph Neural Networks , 2019, NeurIPS.

[22] Alex Smola,et al. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[23] Yafei Dai,et al. NeuGraph: Parallel Deep Neural Network Computation on Large Graphs , 2019, USENIX ATC.

[24] Jan Eric Lenssen,et al. Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[25] Chang Zhou,et al. AliGraph: A Comprehensive Graph Neural Network Platform , 2019, Proc. VLDB Endow..

[26] Hao Wang,et al. SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU , 2019, PPoPP.

[27] Binyu Zang,et al. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[28] Jure Leskovec,et al. How Powerful are Graph Neural Networks? , 2018, ICLR.

[29] Philip S. Yu,et al. A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.