论文信息 - Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems - 字舞流文

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.

Ping Li | Weijie Zhao | Yulei Qian | Mingming Sun | Deping Xie | Ronglai Jia | Ruiquan Ding | Weijie Zhao | P. Li | Rui Ding | Mingming Sun | Deping Xie | Ronglai Jia | Yulei Qian

[1] Ping Li,et al. Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[2] Pengtao Xie,et al. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[3] Dan Alistarh,et al. The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory , 2018, PODC.

[4] Sachin Katti,et al. Bandana: Using Non-volatile Memory for Storing Deep Learning Models , 2018, MLSys.

[5] Jin Li,et al. FlashStore , 2010, Proc. VLDB Endow..

[6] Ping Li,et al. MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search , 2019, KDD.

[7] Graham Cormode,et al. An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[8] Seunghak Lee,et al. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[9] Marc Abrams,et al. Proxy Caching That Estimates Page Load Delays , 1997, Comput. Networks.

[10] Dong Yu,et al. Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features , 2016, KDD.

[11] Shamkant B. Navathe,et al. Vertical partitioning algorithms for database design , 1984, TODS.

[12] Amar Phanishayee,et al. FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[13] Guorui Zhou,et al. Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[14] Amar Phanishayee,et al. Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.

[15] Gerhard Weikum,et al. The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[16] Ping Li,et al. One Permutation Hashing , 2012, NIPS.

[17] Ping Li,et al. Improved Densification of One Permutation Hashing , 2014, UAI.

[18] Shulong Tan,et al. Fast Item Ranking under Neural Network based Measures , 2020, WSDM.

[19] Kilian Q. Weinberger,et al. Feature hashing for large scale multitask learning , 2009, ICML '09.

[20] Xing Xie,et al. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[21] Yunming Ye,et al. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[22] Daniel C. Fain,et al. Sponsored search: A brief history , 2006 .

[23] Joaquin Quiñonero Candela,et al. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[24] Ping Li,et al. SONG: Approximate Nearest Neighbor Search on GPU , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[25] Dhabaleswar K. Panda,et al. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[26] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[27] Leonid B. Sokolinsky,et al. LFU-K: An Effective Buffer Management Replacement Algorithm , 2004, DASFAA.

[28] Ping Li,et al. Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling , 2019, NeurIPS.

[29] Sandy Irani,et al. Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[30] Moses Charikar,et al. Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[31] Edward A. Fox,et al. Caching Proxies: Limitations and Potentials , 1995, WWW.

[32] Yu Cheng,et al. Vertical partitioning for query processing over raw data , 2015, SSDBM.

[33] Jingyuan Zhang,et al. AIBox: CTR Prediction Model Training on a Single Node , 2019, CIKM.

[34] Onur Mutlu,et al. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[35] Sang Lyul Min,et al. LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[36] Divesh Srivastava,et al. Semantic Data Caching and Replacement , 1996, VLDB.

[37] Suman Nath,et al. Cheap and Large CAMs for High Performance Data-Intensive Networked Systems , 2010, NSDI.

[38] Ping Li,et al. Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS) , 2014, UAI.

[39] David P. Williamson,et al. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[40] Denis Foley,et al. Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.

[41] Paul Covington,et al. Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[42] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[43] Tim Kraska,et al. The Case for Learned Index Structures , 2018 .

[44] Heng-Tze Cheng,et al. Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[45] Seunghak Lee,et al. Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.

[46] Ping Li,et al. Möbius Transformation for Fast Inner Product Search on Graph , 2019, NeurIPS.

[47] Andrei Broder,et al. A taxonomy of web search , 2002, SIGF.

[48] Jun Wang,et al. Product-Based Neural Networks for User Response Prediction , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[49] David J. DeWitt,et al. An evaluation of buffer management strategies for relational database systems , 1986, Algorithmica.

[50] Charles X. Ling,et al. Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[51] Fan Yang,et al. FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[52] Bin Fan,et al. SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[53] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[54] Andrea C. Arpaci-Dusseau,et al. WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[55] Eric P. Xing,et al. GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[56] Jin Li,et al. SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[57] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.