Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than $10^{11}$ sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network training computations are contained in GPUs. Extensive experiments on real-world data confirm the effectiveness and the scalability of the proposed system. A 4-node hierarchical GPU parameter server can train a model more than 2X faster than a 150-node in-memory distributed parameter server in an MPI cluster. In addition, the price-performance ratio of our proposed system is 4-9 times better than an MPI-cluster solution.

[1]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[2]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[3]  Dan Alistarh,et al.  The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory , 2018, PODC.

[4]  Sachin Katti,et al.  Bandana: Using Non-volatile Memory for Storing Deep Learning Models , 2018, MLSys.

[5]  Jin Li,et al.  FlashStore , 2010, Proc. VLDB Endow..

[6]  Ping Li,et al.  MOBIUS: Towards the Next Generation of Query-Ad Matching in Baidu's Sponsored Search , 2019, KDD.

[7]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[8]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[9]  Marc Abrams,et al.  Proxy Caching That Estimates Page Load Delays , 1997, Comput. Networks.

[10]  Dong Yu,et al.  Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features , 2016, KDD.

[11]  Shamkant B. Navathe,et al.  Vertical partitioning algorithms for database design , 1984, TODS.

[12]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[13]  Guorui Zhou,et al.  Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[14]  Amar Phanishayee,et al.  Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.

[15]  Gerhard Weikum,et al.  The LRU-K page replacement algorithm for database disk buffering , 1993, SIGMOD Conference.

[16]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[17]  Ping Li,et al.  Improved Densification of One Permutation Hashing , 2014, UAI.

[18]  Shulong Tan,et al.  Fast Item Ranking under Neural Network based Measures , 2020, WSDM.

[19]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[20]  Xing Xie,et al.  xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[21]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[22]  Daniel C. Fain,et al.  Sponsored search: A brief history , 2006 .

[23]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[24]  Ping Li,et al.  SONG: Approximate Nearest Neighbor Search on GPU , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[25]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[26]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[27]  Leonid B. Sokolinsky,et al.  LFU-K: An Effective Buffer Management Replacement Algorithm , 2004, DASFAA.

[28]  Ping Li,et al.  Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling , 2019, NeurIPS.

[29]  Sandy Irani,et al.  Cost-Aware WWW Proxy Caching Algorithms , 1997, USENIX Symposium on Internet Technologies and Systems.

[30]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[31]  Edward A. Fox,et al.  Caching Proxies: Limitations and Potentials , 1995, WWW.

[32]  Yu Cheng,et al.  Vertical partitioning for query processing over raw data , 2015, SSDBM.

[33]  Jingyuan Zhang,et al.  AIBox: CTR Prediction Model Training on a Single Node , 2019, CIKM.

[34]  Onur Mutlu,et al.  Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds , 2017, NSDI.

[35]  Sang Lyul Min,et al.  LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies , 2001, IEEE Trans. Computers.

[36]  Divesh Srivastava,et al.  Semantic Data Caching and Replacement , 1996, VLDB.

[37]  Suman Nath,et al.  Cheap and Large CAMs for High Performance Data-Intensive Networked Systems , 2010, NSDI.

[38]  Ping Li,et al.  Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS) , 2014, UAI.

[39]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[40]  Denis Foley,et al.  Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.

[41]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[42]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[43]  Tim Kraska,et al.  The Case for Learned Index Structures , 2018 .

[44]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[45]  Seunghak Lee,et al.  Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.

[46]  Ping Li,et al.  Möbius Transformation for Fast Inner Product Search on Graph , 2019, NeurIPS.

[47]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[48]  Jun Wang,et al.  Product-Based Neural Networks for User Response Prediction , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[49]  David J. DeWitt,et al.  An evaluation of buffer management strategies for relational database systems , 1986, Algorithmica.

[50]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[51]  Fan Yang,et al.  FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[52]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[53]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[54]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[55]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[56]  Jin Li,et al.  SkimpyStash: RAM space skimpy key-value store on flash-based storage , 2011, SIGMOD '11.

[57]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.