MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions

Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads, recommendation inference is memory-bound due to the many random memory accesses needed to lookup the embedding tables. The inference is also heavily constrained in terms of latency because producing a recommendation for a user must be done in about tens of milliseconds. In this paper, we propose MicroRec, a high-performance inference engine for recommendation systems. MicroRec accelerates recommendation inference by (1) redesigning the data structures involved in the embeddings to reduce the number of lookups needed and (2) taking advantage of the availability of High-Bandwidth Memory (HBM) in FPGA accelerators to tackle the latency by enabling parallel lookups. We have implemented the resulting design on an FPGA board including the embedding lookup step as well as the complete inference process. Compared to the optimized CPU baseline (16 vCPU, AVX2-enabled), MicroRec achieves 13.8~14.7x speedup on embedding lookup alone and 2.5$~5.4x speedup for the entire recommendation inference in terms of throughput. As for latency, CPU-based engines needs milliseconds for inferring a recommendation while MicroRec only takes microseconds, a significant advantage in real-time recommendation systems.

[1]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[2]  Joseph K. Bradley,et al.  Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale , 2016, NIPS.

[3]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[4]  Guangwen Yang,et al.  F-CNN: An FPGA-based framework for training Convolutional Neural Networks , 2016, 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[5]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[6]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[7]  Eric P. Xing,et al.  GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server , 2016, EuroSys.

[8]  C. Gomez-Uribe,et al.  The Netflix Recommender System: Algorithms, Business Value, and Innovation , 2016, ACM Trans. Manag. Inf. Syst..

[9]  Olatunji Ruwase,et al.  Optimizing CNNs on Multicores for Scalability, Performance and Goodput , 2017, ASPLOS.

[10]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[11]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[12]  Gustavo Alonso,et al.  Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[13]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[14]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[15]  V. Sze,et al.  Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2016, IEEE Journal of Solid-State Circuits.

[16]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[17]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[18]  Gustavo Alonso,et al.  FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[19]  Christopher Olston,et al.  TensorFlow-Serving: Flexible, High-Performance ML Serving , 2017, ArXiv.

[20]  Keith Kim,et al.  HBM (High Bandwidth Memory) DRAM Technology and Architecture , 2017, 2017 IEEE International Memory Workshop (IMW).

[21]  Ioannis Mitliagkas,et al.  Deep Learning at 15PF : Supervised and Semi-Supervised Classification for Scientific Data , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Amar Phanishayee,et al.  Accelerating Deep Learning Workloads Through Efficient Multi-Model Execution , 2018 .

[23]  InferLine: ML Inference Pipeline Composition Framework , 2018, ArXiv.

[24]  Minsik Cho BlueConnect: Novel Hierarchical All-Reduce on Multi-tired Network for Deep Learning , 2018 .

[25]  Gustavo Alonso,et al.  A Flexible K-Means Operator for Hybrid Databases , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[26]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[27]  Paramvir Bahl,et al.  Focus: Querying Large Video Datasets with Low Latency and Low Cost , 2018, OSDI.

[28]  Martin D. Schatz,et al.  Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.

[29]  Farinaz Koushanfar,et al.  ReBNet: Residual Binarized Neural Network , 2017, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[30]  Dimitris S. Papailiopoulos,et al.  The Effect of Network Width on the Performance of Large-batch Training , 2018, NeurIPS.

[31]  Guorui Zhou,et al.  Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[32]  Noam Shazeer,et al.  HydraNets: Specialized Dynamic Architectures for Efficient Inference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Hadi Esmaeilzadeh,et al.  ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks , 2018 .

[34]  Pradeep Dubey,et al.  Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[35]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[36]  Michael I. Jordan,et al.  Ray: A Distributed Framework for Emerging AI Applications , 2017, OSDI.

[37]  Jie Liu,et al.  Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 Hours , 2019, ECML/PKDD.

[38]  Zhiru Zhang,et al.  Boosting the Performance of CNN Accelerators with Dynamic Fine-Grained Channel Gating , 2019, MICRO.

[39]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[40]  Carole-Jean Wu,et al.  Machine Learning at Facebook: Understanding Inference at the Edge , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[41]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[42]  Chang Zhou,et al.  Deep Interest Evolution Network for Click-Through Rate Prediction , 2018, AAAI.

[43]  Rudy Lauwereins,et al.  Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference , 2019, 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[44]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[45]  Gustavo Alonso,et al.  Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning , 2019, Proc. VLDB Endow..

[46]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[47]  Yiran Chen,et al.  MobiEye: An Efficient Cloud-based Video Detection System for Real-Time Mobile Applications , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[48]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[49]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[50]  Minsoo Rhu,et al.  TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.

[51]  Dylan Malone Stuart,et al.  Laconic Deep Learning Inference Acceleration , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[52]  Li Wei,et al.  Recommending what video to watch next: a multitask ranking system , 2019, RecSys.

[53]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[54]  Yuhao Zhu,et al.  ASV: Accelerated Stereo Vision System , 2019, MICRO.

[55]  Chaojian Li,et al.  HALO: Hardware-Aware Learning to Optimize , 2020, ECCV.

[56]  Kartikeya Bhardwaj,et al.  A Hardware Prototype Targeting Distributed Deep Learning for On-device Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[57]  Vinod Kathail,et al.  Xilinx Vitis Unified Software Platform , 2020, FPGA.

[58]  Jie Zhang,et al.  Benchmarking High Bandwidth Memory on FPGAs , 2020, ArXiv.

[59]  Nezihe Merve Gurel,et al.  Compressive Sensing Using Iterative Hard Thresholding With Low Precision Data Representation: Theory and Applications , 2020, IEEE Transactions on Signal Processing.

[60]  Gustavo Alonso,et al.  Making Search Engines Faster by Lowering the Cost of Querying Business Rules Through FPGAs , 2020, SIGMOD Conference.

[61]  Wen-mei W. Hwu,et al.  SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems , 2019, MLSys.

[62]  Martin D. Schatz,et al.  RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[63]  Carole-Jean Wu,et al.  DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[64]  Yujeong Choi,et al.  PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[65]  Gustavo Alonso,et al.  BiS-KM: Enabling Any-Precision K-Means on FPGAs , 2020, FPGA.

[66]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[67]  Tao Zhang,et al.  EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[68]  Minsoo Rhu,et al.  Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[69]  Torsten Hoefler,et al.  Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis , 2019, FPGA.