BiFeat: Supercharge GNN Training via Graph Feature Quantization

Graph Neural Networks (GNNs) is a promising approach for applications with nonEuclidean data. However, training GNNs on large scale graphs with hundreds of millions nodes is both resource and time consuming. Different from DNNs, GNNs usually have larger memory footprints, and thus the GPU memory capacity and PCIe bandwidth are the main resource bottlenecks in GNN training. To address this problem, we present BiFeat: a graph feature quantization methodology to accelerate GNN training by significantly reducing the memory footprint and PCIe bandwidth requirement so that GNNs can take full advantage of GPU computing capabilities. Our key insight is that unlike DNN, GNN is less prone to the information loss of input features caused by quantization. We identify the main accuracy impact factors in graph feature quantization and theoretically prove that BiFeat training converges to a network where the loss is within $\epsilon$ of the optimal loss of uncompressed network. We perform extensive evaluation of BiFeat using several popular GNN models and datasets, including GraphSAGE on MAG240M, the largest public graph dataset. The results demonstrate that BiFeat achieves a compression ratio of more than 30 and improves GNN training speed by 200%-320% with marginal accuracy loss. In particular, BiFeat achieves a record by training GraphSAGE on MAG240M within one hour using only four GPUs.

[1]  Tom Goldstein,et al.  VQ-GNN: A Universal Framework to Scale up Graph Neural Networks using Vector Quantization , 2021, NeurIPS.

[2]  Jialin Dong,et al.  Global Neighbor Sampling for Mixed CPU-GPU Training on Giant Graphs , 2021, KDD.

[3]  Jure Leskovec,et al.  GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings , 2021, ICML.

[4]  Stefanos Zafeiriou,et al.  Binary Graph Neural Networks , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yunxin Liu,et al.  PaGraph: Scaling GNN training on large graphs via computation-aware caching , 2020, SoCC.

[6]  Heiko Schwarz,et al.  Dependent Scalar Quantization For Neural Network Compression , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[7]  Tim Beißbarth,et al.  Explaining decisions of graph convolutional neural networks: patient-specific molecular subnetworks responsible for metastasis prediction in breast cancer , 2020, Genome Medicine.

[8]  Xu Li,et al.  SGQuant: Squeezing the Last Bit on Graph Neural Networks with Specialized Quantization , 2020, 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI).

[9]  Shuiwang Ji,et al.  Towards Deeper Graph Neural Networks , 2020, KDD.

[10]  Yaliang Li,et al.  Simple and Deep Graph Convolutional Networks , 2020, ICML.

[11]  Yizhou Sun,et al.  GPT-GNN: Generative Pre-Training of Graph Neural Networks , 2020, KDD.

[12]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[13]  Enhong Chen,et al.  Graph Convolutional Networks with Markov Random Field Reasoning for Social Spammer Detection , 2020, AAAI.

[14]  Alexandros Iosifidis,et al.  Progressive Graph Convolutional Networks for Semi-Supervised Node Classification , 2020, IEEE Access.

[15]  G. Karypis,et al.  Deep Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks. , 2019 .

[16]  Cho-Jui Hsieh,et al.  Convergence of Adversarial Training in Overparametrized Neural Networks , 2019, NeurIPS.

[17]  Samy Bengio,et al.  Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks , 2019, KDD.

[18]  Yansong Feng,et al.  Cross-lingual Knowledge Graph Alignment via Graph Matching Neural Network , 2019, ACL.

[19]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[20]  Damian Szklarczyk,et al.  STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets , 2018, Nucleic Acids Res..

[21]  Wei Liu,et al.  Bi-Real Net: Binarizing Deep Network Towards Real-Network Performance , 2018, International Journal of Computer Vision.

[22]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[23]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[24]  Sergio Escalera,et al.  Beyond One-hot Encoding: lower dimensional target embedding , 2018, Image Vis. Comput..

[25]  Cao Xiao,et al.  FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling , 2018, ICLR.

[26]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[27]  Stephan Günnemann,et al.  Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking , 2017, ICLR.

[28]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[29]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[30]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[31]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[32]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[33]  S. V. N. Vishwanathan,et al.  A Structural Smoothing Framework For Robust Graph Comparison , 2015, NIPS.

[34]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[35]  Jan-Michael Frahm,et al.  Comparative Evaluation of Binary Features , 2012, ECCV.

[36]  Klaus Schulten,et al.  GPU-accelerated molecular modeling coming of age. , 2010, Journal of molecular graphics & modelling.

[37]  Ashwin Srinivasan,et al.  Statistical Evaluation of the Predictive Toxicology Challenge 2000-2001 , 2003, Bioinform..

[38]  A. Debnath,et al.  Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. , 1991, Journal of medicinal chemistry.

[39]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  Martin Paegelow,et al.  Geomatic Approaches for Modeling Land Change Scenarios , 2018 .

[42]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[43]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[44]  Hannu Toivonen,et al.  Statistical evaluation of the predictive toxicology challenge , 2000 .

[45]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.