MGG: Accelerating Graph Neural Networks with Fine-Grained Intra-Kernel Communication-Computation Pipelining on Multi-GPU Platforms

The increasing size of input graphs for graph neural networks (GNNs) highlights the demand for using multi-GPU platforms. However, existing multi-GPU GNN systems optimize the computation and communication individually based on the conventional practice of scaling dense DNNs. For irregularly sparse and fine-grained GNN workloads, such solutions miss the opportunity to jointly schedule/optimize the computation and communication operations for high-performance delivery. To this end, we propose MGG, a novel system design to accelerate full-graph GNNs on multi-GPU platforms. The core of MGG is its novel dynamic software pipeline to facilitate fine-grained computation-communication overlapping within a GPU kernel. Specifically, MGG introduces GNN-tailored pipeline construction and GPU-aware pipeline mapping to facilitate workload balancing and operation overlapping. MGG also incorporates an intelligent runtime design with analytical modeling and optimization heuristics to dynamically improve the execution performance. Extensive evaluation reveals that MGG outperforms state-of-the-art full-graph GNN systems across various settings: on average 4.41X, 4.81X, and 10.83X faster than DGL, MGG-UVM, and ROC, respectively.

[1]  Yunru Bai,et al.  PiPAD: Pipelined and Parallel Dynamic GNN Training on GPUs , 2023, PPoPP.

[2]  D. Mudigere,et al.  EL-Rec: Efficient Large-Scale Recommendation Model Training via Tensor-Train Embedding Table , 2022, SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Jingren Zhou,et al.  GNNLab: a factored system for sample-based GNN training over GPUs , 2022, EuroSys.

[4]  Ping Luo,et al.  vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training , 2022, IEEE Transactions on Parallel and Distributed Systems.

[5]  Katherine A. Yelick,et al.  CloudBank: Managed Services to Simplify Cloud Access for Computer Science Research and Education , 2021, PEARC.

[6]  James Cheng,et al.  DGCL: an efficient communication library for distributed GNN training , 2021, EuroSys.

[7]  Wen-mei W. Hwu,et al.  Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture , 2021, Proc. VLDB Endow..

[8]  Jinjun Xiong,et al.  PyTorch-Direct: Enabling GPU Centric Data Access for Very Large Graph Neural Network Training with Irregular Accesses , 2021, ArXiv.

[9]  Yunxin Liu,et al.  PaGraph: Scaling GNN training on large graphs via computation-aware caching , 2020, SoCC.

[10]  D. Narayanan,et al.  Memory-Efficient Pipeline-Parallel DNN Training , 2020, ICML.

[11]  Lei Deng,et al.  GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs , 2020, ArXiv.

[12]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[13]  Yufeng Zhang,et al.  Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks , 2020, ACL.

[14]  Alexander Aiken,et al.  Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc , 2020, MLSys.

[15]  Ramyad Hadidi,et al.  Batch-Aware Unified Memory Management in GPUs for Irregular Workloads , 2020, ASPLOS.

[16]  Dongrui Fan,et al.  HyGCN: A GCN Accelerator with Hybrid Architecture , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[17]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[18]  Alex Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[19]  Yafei Dai,et al.  NeuGraph: Parallel Deep Neural Network Computation on Large Graphs , 2019, USENIX ATC.

[20]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[22]  Xu Liu,et al.  Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[23]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[24]  Jure Leskovec,et al.  Hierarchical Graph Representation Learning with Differentiable Pooling , 2018, NeurIPS.

[25]  Cao Xiao,et al.  FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling , 2018, ICLR.

[26]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[27]  Mathias Niepert,et al.  Learning Graph Representations with Embedding Propagation , 2017, NIPS.

[28]  Lina Yao,et al.  Deep Learning Based Recommender System , 2017, ACM Comput. Surv..

[29]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[30]  Wenguang Chen,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.

[31]  Weimin Zheng,et al.  Exploring the Hidden Dimension in Graph Processing , 2016, OSDI.

[32]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[33]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[34]  Makoto Onizuka,et al.  Rabbit Order: Just-in-Time Parallel Reordering for Fast Graph Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[36]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[37]  Steven Derrien,et al.  Runtime dependency analysis for loop pipelining in High-Level Synthesis , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[38]  Pedro F. Miret,et al.  Wikipedia , 2008, Monatsschrift für Deutsches Recht.

[39]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Ernest Valveny,et al.  Graph embedding in vector spaces by node attribute statistics , 2012, Pattern Recognit..

[41]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[42]  Kaspar Riesen,et al.  Graph Classification and Clustering Based on Vector Space Embedding , 2010, Series in Machine Perception and Artificial Intelligence.

[43]  Srikanta J. Bedathur,et al.  Towards time-aware link prediction in evolving social networks , 2009, SNA-KDD '09.

[44]  Jérôme Kunegis,et al.  Learning spectral graph transformations for link prediction , 2009, ICML '09.

[45]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[46]  Hsinchun Chen,et al.  Link prediction approach to collaborative filtering , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[47]  Anand Padmanabha Iyer,et al.  P3: Distributed Deep Graph Learning at Scale , 2021, OSDI.

[48]  Algorithms, Theory , 2006 .