Reducing Communication in Graph Neural Network Training

Author(s): Tripathy, Alok; Yelick, Katherine; Buluc, Aydin | Abstract: Graph Neural Networks (GNNs) are powerful and flexible neural networks that use the naturally sparse connectivity information of the data. GNNs represent this connectivity as sparse matrices, which have lower arithmetic intensity and thus higher communication costs compared to dense matrices, making GNNs harder to scale to high concurrencies than convolutional or fully-connected neural networks. We introduce a family of parallel algorithms for training GNNs and show that they can asymptotically reduce communication compared to previous parallel GNN training methods. We implement these algorithms, which are based on 1D, 1.5D, 2D, and 3D sparse-dense matrix multiplication, using torch.distributed on GPU-equipped clusters. Our algorithms optimize communication across the full GNN training pipeline. We train GNNs on over a hundred GPUs on multiple datasets, including a protein network with over a billion edges.

[1]  Kurt Keutzer,et al.  Integrated Model, Batch, and Domain Parallelism in Training Neural Networks , 2017, SPAA.

[2]  Zhiyuan Liu,et al.  Graph Neural Networks: A Review of Methods and Applications , 2018, AI Open.

[3]  Leonid Oliker,et al.  Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[5]  Tommi Vatanen,et al.  Structure-Based Function Prediction using Graph Convolutional Networks , 2019, bioRxiv.

[6]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[7]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[8]  Marc Snir,et al.  Channel and filter parallelism for large-scale CNN training , 2019, SC.

[9]  John R. Gilbert,et al.  On the representation and multiplication of hypersparse matrices , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  Torsten Hoefler,et al.  Demystifying Parallel and Distributed Deep Learning , 2018, ACM Comput. Surv..

[11]  Alexander Aiken,et al.  Improving the Accuracy, Scalability, and Performance of Graph Neural Networks with Roc , 2020, MLSys.

[12]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[13]  Alexander J. Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[14]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[15]  Samuel Williams,et al.  Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[16]  Yafei Dai,et al.  NeuGraph: Parallel Deep Neural Network Computation on Large Graphs , 2019, USENIX ATC.

[17]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[18]  Dan Alistarh,et al.  SparCML: high-performance sparse communication for machine learning , 2018, SC.

[19]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[20]  Georgios A. Pavlopoulos,et al.  HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks , 2018, Nucleic acids research.

[21]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[22]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[23]  John D. Owens,et al.  Design Principles for Sparse Matrix Multiplication on the GPU , 2018, Euro-Par.

[24]  Marc Snir,et al.  Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[25]  Chang Zhou,et al.  AliGraph: A Comprehensive Graph Neural Network Platform , 2019, Proc. VLDB Endow..

[26]  Peter Sanders,et al.  Recent Advances in Graph Partitioning , 2013, Algorithm Engineering.

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[29]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[30]  Samuel Williams,et al.  Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication , 2015, SIAM J. Sci. Comput..

[31]  James Demmel,et al.  Cyclops Tensor Framework: Reducing Communication and Eliminating Load Imbalance in Massively Parallel Contractions , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[32]  Alexander Peysakhovich,et al.  PyTorch-BigGraph: A Large-scale Graph Embedding System , 2019, SysML.

[33]  James Demmel,et al.  Communication optimal parallel multiplication of sparse random matrices , 2013, SPAA.

[34]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[35]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.