Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000× Compression and 2.7× Faster Inference

Deep learning for recommendation data is one of the most pervasive and challenging AI workload in recent times. State-of-the-art recommendation models are one of the largest models matching the likes of GPT-3 and Switch Transformer. Challenges in deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical tokens. These embedding tables in industrial scale models can be as large as hundreds of terabytes. Such large models lead to a plethora of engineering challenges, not to mention prohibitive communication overheads, and slower training and inference times. Of these, slower inference time directly impacts user experience. Model compression for DLRM is gaining traction and the community has recently shown impressive compression results. In this paper, we present Random Offset Block Embedding Array (ROBE) as a low memory alternative to embedding tables which provide orders of magnitude reduction in memory usage while maintaining accuracy and boosting execution speed. ROBE is a simple fundamental approach in improving both cache performance and the variance of randomized hashing, which could be of independent interest in itself. We demonstrate that we can successfully train DLRM models with same accuracy while using 1000× less memory. A 1000× compressed model directly results in faster inference without any engineering effort. In particular, we show that we can train DLRM model using ROBE Array of size 100MB on a single GPU to achieve AUC of 0.8025 or higher as required by official MLPerf CriteoTB benchmark DLRM model of 100GB while achieving about 3.1× (209%) improvement in inference throughput. ar X iv :2 10 8. 02 19 1v 2 [ cs .I R ] 2 2 Ja n 20 22 Random Offset Block Embedding Array (ROBE)

[1]  Linpeng Huang,et al.  Differentiable Neural Input Search for Recommender Systems , 2020, ArXiv.

[2]  Jiyan Yang,et al.  Training with Low-precision Embedding Tables , 2018 .

[3]  Quoc V. Le,et al.  Neural Input Search for Large Scale Recommendation Models , 2019, KDD.

[4]  Junlin Zhang,et al.  FiBiNET: combining feature importance and bilinear feature interaction for click-through rate prediction , 2019, RecSys.

[5]  Xing Xie,et al.  xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems , 2018, KDD.

[6]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, NIPS.

[7]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[8]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[9]  Yuchen Hao,et al.  High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models , 2021, ArXiv.

[10]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[11]  Aditya Desai,et al.  Semantically Constrained Memory Allocation (SCMA) for Embedding in Efficient Recommendation Systems , 2021, ArXiv.

[12]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[13]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[15]  Chen Gao,et al.  Learnable Embedding Sizes for Recommender Systems , 2021, ICLR.

[16]  Gang Fu,et al.  Deep & Cross Network for Ad Click Predictions , 2017, ADKDD@KDD.

[17]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[18]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[19]  Carole-Jean Wu,et al.  TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models , 2021, MLSys.

[20]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[21]  Jian Tang,et al.  AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks , 2018, CIKM.

[22]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[23]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[24]  Jiliang Tang,et al.  AutoEmb: Automated Embedding Dimensionality Search in Streaming Recommendations , 2020, ArXiv.

[25]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[26]  James Zou,et al.  Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems , 2019, 2021 IEEE International Symposium on Information Theory (ISIT).

[27]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[28]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[29]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[30]  Jiliang Tang,et al.  Automated Embedding Size Search in Deep Recommender Systems , 2020, SIGIR.

[31]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[32]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).