A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent

Scaling node embedding systems to efficiently process networks in real-world applications that often contain hundreds of billions of edges with high-dimension node features remains a challenging problem. In this paper we present a high-performance multi-GPU node embedding system that uses hybrid model data parallel training. We propose a hierarchical data partitioning strategy and an embedding training pipeline to optimize both communication and memory usage on a GPU cluster. With the decoupled design of our random walk engine and embedding training engine, we can run both random walk and embedding training with high flexibility to fully utilize all computing resources on a GPU cluster. We evaluate the system on real-world and synthesized networks with various node embedding tasks. Using 40 NVIDIA V100 GPUs on a network with over two hundred billion edges and one billion nodes, our implementation requires only 200 seconds to finish one training epoch. We also achieve 5.9x-14.4x speedup on average over the current state-of-the-art multi-GPU single-node embedding system with competitive or better accuracy on open datasets.

[1]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[2]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[3]  Jure Leskovec,et al.  Defining and evaluating network communities based on ground-truth , 2012, Knowledge and Information Systems.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[8]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[9]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[10]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[11]  Sebastiano Vigna,et al.  The Graph Structure in the Web - Analyzed on Different Aggregation Levels , 2015, J. Web Sci..

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[14]  Erik Ordentlich,et al.  Network-Efficient Distributed Word2vec Training System for Large Vocabularies , 2016, CIKM.

[15]  Saurabh Gupta,et al.  BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs , 2017, MLHPC@SC.

[16]  Stergios Stergiou,et al.  Distributed Negative Sampling for Word Embeddings , 2017, AAAI.

[17]  Nitesh V. Chawla,et al.  metapath2vec: Scalable Representation Learning for Heterogeneous Networks , 2017, KDD.

[18]  Dik Lun Lee,et al.  Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba , 2018, KDD.

[19]  Pradeep Dubey,et al.  Parallelizing Word2Vec in Shared and Distributed Memory , 2016, IEEE Transactions on Parallel and Distributed Systems.

[20]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[21]  Jian Tang,et al.  GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding , 2019, WWW.

[22]  Alexander Peysakhovich,et al.  PyTorch-BigGraph: A Large-scale Graph Embedding System , 2019, SysML.

[23]  Xiaosong Ma,et al.  KnightKing: a fast distributed graph random walk engine , 2019, SOSP.

[24]  C. Bayan Bruss,et al.  Graph Embeddings at Scale , 2019, ArXiv.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  G. Karypis,et al.  DGL-KE: Training Knowledge Graph Embeddings at Scale , 2020, SIGIR.

[27]  Dongxu Yang,et al.  EDGES: An Efficient Distributed Graph Embedding System on GPU Clusters , 2021, IEEE Transactions on Parallel and Distributed Systems.

[28]  Parijat Dube,et al.  Slow and Stale Gradients Can Win the Race , 2018, IEEE Journal on Selected Areas in Information Theory.