论文信息 - Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Understanding Capacity-Driven Scale-Out Neural Recommendation Inference

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers. This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution and sparsity of input inference requests. We further evaluate three embedding table mapping strategies of three DLRM-like models and specify challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe only a marginal latency overhead when the data-center scale recommendation models are served with the distributed inference manner--P99 latency is increased by only 1% in the best case configuration. The latency overheads are largely a result of the commodity infrastructure used and the sparsity of embedding tables. Even more encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

[1] Paul Covington,et al. Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[2] Newsha Ardalani,et al. Beyond human-level accuracy: computational challenges in deep learning , 2019, PPoPP.

[3] David M. Brooks,et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4] Carole-Jean Wu,et al. Cross-Stack Workload Characterization of Deep Recommendation Systems , 2020, 2020 IEEE International Symposium on Workload Characterization (IISWC).

[5] Yehuda Koren,et al. Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[6] Minsoo Rhu,et al. Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[7] Developing a Recommendation Benchmark for MLPerf Training and Inference , 2020, ArXiv.

[8] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[9] Carole-Jean Wu,et al. The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[10] Guorui Zhou,et al. Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[11] Jichuan Chang,et al. Software-Defined Far Memory in Warehouse-Scale Computers , 2019, ASPLOS.

[12] Cody Coleman,et al. MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[13] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[14] Carole-Jean Wu,et al. DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[15] Vikas Raunak. Simple and Effective Dimensionality Reduction for Word Embeddings , 2017 .

[16] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[17] Wilson C. Hsieh,et al. Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[18] Christopher Olston,et al. TensorFlow-Serving: Flexible, High-Performance ML Serving , 2017, ArXiv.

[19] Jingyuan Zhang,et al. AIBox: CTR Prediction Model Training on a Single Node , 2019, CIKM.

[20] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[21] Martin D. Schatz,et al. RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[22] Zi Yin,et al. On the Dimensionality of Word Embedding , 2018, NeurIPS.

[23] Yunming Ye,et al. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[24] Kaushik Veeraraghavan,et al. Canopy: An End-to-End Performance Tracing And Analysis System , 2017, SOSP.

[25] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[26] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[27] Bracha Shapira,et al. Recommender Systems Handbook , 2015, Springer US.

[28] David Patterson,et al. MLPerf Training Benchmark , 2019, MLSys.

[29] Torsten Hoefler,et al. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. , 2018 .

[30] Pramod Viswanath,et al. All-but-the-Top: Simple and Effective Postprocessing for Word Representations , 2017, ICLR.

[31] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[32] Jure Leskovec,et al. Graph Convolutional Neural Networks for Web-Scale Recommender Systems , 2018, KDD.

[33] Onur Mutlu,et al. Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34] David A. Wood,et al. Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches , 2004 .

[35] Ed H. Chi,et al. Factorized Deep Retrieval and Distributed TensorFlow Serving , 2018 .

[36] Sachin Katti,et al. Bandana: Using Non-volatile Memory for Storing Deep Learning Models , 2018, MLSys.

[37] Bor-Yiing Su,et al. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems , 2020, ArXiv.

[38] Ping Li,et al. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.

[39] Valentin Khrulkov,et al. Tensorized Embedding Layers for Efficient Model Compression , 2019, ArXiv.

[40] Levent Sagun,et al. Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[41] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[42] Yinghai Lu,et al. Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[43] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[44] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[45] Trevor N. Mudge,et al. Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge , 2017, ASPLOS.

[46] Yang Yang,et al. Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[47] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[48] Donald Beaver,et al. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[49] Hideki Nakayama,et al. Compressing Word Embeddings via Deep Compositional Code Learning , 2017, ICLR.

[50] Minsoo Rhu,et al. Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).