ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Because of the superior feature representation ability of deep learning, various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies. To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently, which makes speeding up the training process an essential problem. Different from the models with dense training data, the training data for CTR models is usually high-dimensional and sparse. To transform the high-dimensional sparse input into low-dimensional dense real-value vectors, almost all deep CTR models adopt the embedding layer, which easily reaches hundreds of GB or even TB. Since a single GPU cannot afford to accommodate all the embedding parameters, when performing distributed training, it is not reasonable to conduct the data-parallelism only. Therefore, existing distributed training platforms for recommendation adopt model-parallelism. Specifically, they use CPU (Host) memory of servers to maintain and update the embedding parameters and utilize GPU worker to conduct forward and backward computations. Unfortunately, these platforms suffer from two bottlenecks: (1) the latency of pull & push operations between Host and GPU; (2) parameters update and synchronization in the CPU servers. To address such bottlenecks, in this paper, we propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models. Specifically, in SFCTR, we also store huge embedding table in CPU but utilize GPU instead of CPU to conduct embedding synchronization efficiently. To reduce the latency of data transfer between both GPU-Host and GPU-GPU, the MixCache mechanism and Virtual Sparse Id operation are proposed. Comprehensive experiments are conducted to demonstrate the effectiveness and efficiency of SFCTR. In addition, our system will be open-source based on MindSpore in the near future.

[1]  Wei Guo,et al.  GraphSAIL: Graph Structure Aware Incremental Learning for Recommender Systems , 2020, CIKM.

[2]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[3]  Nicholas G. Polson,et al.  Deep Learning in Finance , 2016, ArXiv.

[4]  Peng Sun,et al.  XDL: an industrial deep learning framework for high-dimensional sparse data , 2019 .

[5]  Quoc V. Le,et al.  Neural Input Search for Large Scale Recommendation Models , 2019, KDD.

[6]  De-Chuan Zhan,et al.  Adaptive Deep Models for Incremental Learning: Considering Capacity Scalability and Sustainability , 2019, KDD.

[7]  Ping Li,et al.  Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Yichao Wang,et al.  A Practical Incremental Method to Train Deep CTR Models , 2020, ArXiv.

[10]  Junlin Zhang,et al.  FiBiNET: combining feature importance and bilinear feature interaction for click-through rate prediction , 2019, RecSys.

[11]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[12]  Steffen Rendle,et al.  Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.

[13]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[14]  Fan Li,et al.  Distributed Equivalent Substitution Training for Large-Scale Recommender Systems , 2020, SIGIR.

[15]  Jure Leskovec,et al.  Graph Convolutional Neural Networks for Web-Scale Recommender Systems , 2018, KDD.

[16]  Gang Fu,et al.  Deep & Cross Network for Ad Click Predictions , 2017, ADKDD@KDD.

[17]  Guorui Zhou,et al.  Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[18]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[19]  Suyog Gupta,et al.  Model Accuracy and Runtime Tradeoff in Distributed Deep Learning: A Systematic Study , 2015, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[20]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[21]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[22]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.

[23]  Huifeng Guo,et al.  PAL: a position-bias aware learning framework for CTR prediction in live recommender systems , 2019, RecSys.

[24]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[25]  Chih-Jen Lin,et al.  Field-aware Factorization Machines for CTR Prediction , 2016, RecSys.

[26]  Yong Yu,et al.  Product-Based Neural Networks for User Response Prediction over Multi-Field Categorical Data , 2018, ACM Trans. Inf. Syst..

[27]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[28]  Alexander Peysakhovich,et al.  PyTorch-BigGraph: A Large-scale Graph Embedding System , 2019, SysML.

[29]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[30]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[31]  Amith R. Mamidala,et al.  Efficient Barrier and Allreduce on Infiniband clusters using multicast and adaptive algorithms , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[32]  Jun Wang,et al.  Deep Learning over Multi-field Categorical Data - - A Case Study on User Response Prediction , 2016, ECIR.

[33]  Xing Xie,et al.  xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[34]  Byung-Gon Chun,et al.  Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[37]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[38]  Yunming Ye,et al.  DeepFM: An End-to-End Wide & Deep Learning Framework for CTR Prediction , 2018, ArXiv.

[39]  Stefanos Zafeiriou,et al.  AutoGroup: Automatic Feature Grouping for Modelling Explicit High-Order Feature Interactions in CTR Prediction , 2020, SIGIR.

[40]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[41]  Bin Liu,et al.  Feature Generation by Convolutional Neural Network for Click-Through Rate Prediction , 2019, WWW.