HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework

Embedding models have been an effective learning paradigm for high-dimensional data. However, one open issue of embedding models is that their representations (latent factors) often result in large parameter space. We observe that existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominates the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding basis. Compared to previous work that only allows staleness for read operations, HET also utilizes staleness for write operations. Evaluations on six representative tasks show that HET achieves up to 88% embedding communication reductions and up to 20.68× performance speedup over the state-of-the-art baselines. PVLDB Reference Format: Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, Bin Cui. HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework . PVLDB, 15(2): 312 320, 2022. doi:10.14778/3489496.3489511 PVLDB Artifact Availability: The source code of this research paper has been made publicly available at https://github.com/PKU-DAIR/Hetu/.

[1]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[2]  ZhangHongchao,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2016 .

[3]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[4]  David Hillerkuss,et al.  LDA * : A robust and largescale topic modeling system , 2017 .

[5]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[6]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[7]  Lei Chen,et al.  Lasagne: A Multi-Layer Graph Convolutional Network Framework via Node-aware Deep Architecture (Extended Abstract) , 2022, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[8]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[9]  Lin Liu,et al.  Dynamic network embedding via incremental skip-gram with negative sampling , 2019, Science China Information Sciences.

[10]  Gang Fu,et al.  Deep & Cross Network for Ad Click Predictions , 2017, ADKDD@KDD.

[11]  Guorui Zhou,et al.  Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[12]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[13]  Bin Cui,et al.  DeGNN: Improving Graph Neural Networks with Graph Decomposition , 2021, KDD.

[14]  Wei Zhang,et al.  Asynchronous Decentralized Parallel Stochastic Gradient Descent , 2017, ICML.

[15]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[16]  Ping Li,et al.  Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[19]  Kai Ren,et al.  Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Quoc V. Le,et al.  Document Embedding with Paragraph Vectors , 2015, ArXiv.

[21]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[22]  Jiawei Jiang,et al.  Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce , 2021, SIGMOD Conference.

[23]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[24]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[25]  Zijian Zhang,et al.  Asynchronous Training of Word Embeddings for Large Text Corpora , 2018, WSDM.

[26]  Hongyan Liu,et al.  Bi-Labeled LDA: Inferring Interest Tags for Non-famous Users in Social Network , 2019, Data Science and Engineering.

[27]  Pengtao Xie,et al.  Strategies and Principles of Distributed Machine Learning on Big Data , 2015, ArXiv.

[28]  Saeed Ghadimi,et al.  Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization , 2013, Mathematical Programming.

[29]  Gediminas Adomavicius,et al.  Impact of data characteristics on recommender systems performance , 2012, TMIS.

[30]  Qian Zhao,et al.  Categorical-attributes-based item classification for recommender systems , 2018, RecSys.

[31]  Fei Wang,et al.  General-Purpose User Embeddings based on Mobile App Usage , 2020, KDD.

[32]  Zhipeng Zhang,et al.  PS2: Parameter Server on Spark , 2019, SIGMOD Conference.

[33]  Yang Zhi,et al.  CuWide: Towards Efficient Flow-based Training for Sparse Wide Models on GPUs , 2020 .

[34]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[35]  Xing Xie,et al.  xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[36]  Samy Bengio,et al.  Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks , 2019, KDD.

[37]  Xipeng Qiu,et al.  Syntax-guided text generation via graph neural network , 2021, Science China Information Sciences.

[38]  Steven Skiena,et al.  DeepWalk: online learning of social representations , 2014, KDD.

[39]  Byung-Gon Chun,et al.  Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.

[40]  Xingquan Zhu,et al.  Deep Learning for User Interest and Response Prediction in Online Display Advertising , 2020, Data Science and Engineering.

[41]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[42]  Seunghak Lee,et al.  STRADS: a distributed framework for scheduled model parallel machine learning , 2016, EuroSys.

[43]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[44]  Yijun Huang,et al.  Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization , 2015, NIPS.

[45]  Chi-Yin Chow,et al.  GeoSoCa: Exploiting Geographical, Social and Categorical Correlations for Point-of-Interest Recommendations , 2015, SIGIR.

[46]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.