Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale

Tremendous success of machine learning (ML) and the unabated growth in model complexity motivated many ML-specific designs in hardware architectures to speed up the model inference. While these architectures are diverse, highly optimized low-precision arithmetic is a component shared by most. Nevertheless, recommender systems important to Facebook’s personalization services are demanding and complex: They must serve billions of users per month responsively with low latency while maintaining high prediction accuracy. Do these low-precision architectures work well with our production recommendation systems? They do. But not without significant effort. In this article, we share our search strategies to adapt reference recommendation models to low-precision hardware, our optimization of low-precision compute kernels, and the tool chain to maintain our models’ accuracy throughout their lifespan. We believe our lessons from the trenches can promote better codesign between hardware architecture and software engineering, and advance the state of the art of ML in industry.

[1]  Chang Zhou,et al.  Deep Interest Evolution Network for Click-Through Rate Prediction , 2018, AAAI.

[2]  Carole-Jean Wu,et al.  Understanding Capacity-Driven Scale-Out Neural Recommendation Inference , 2020, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[3]  Raghuraman Krishnamoorthi,et al.  Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.

[4]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  Kushal Datta,et al.  Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.

[6]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[7]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[8]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[9]  David Patterson,et al.  MLPerf Training Benchmark , 2019, MLSys.

[10]  Carole-Jean Wu,et al.  Cross-Stack Workload Characterization of Deep Recommendation Systems , 2020, 2020 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Ping Li,et al.  Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.

[12]  Gang Fu,et al.  Deep & Cross Network for Ad Click Predictions , 2017, ADKDD@KDD.

[13]  Guorui Zhou,et al.  Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.

[14]  Jie Training Deep Learning Recommendation Model with Quantized Collective Communications , 2020 .

[15]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[16]  Carole-Jean Wu,et al.  MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance , 2020, IEEE Micro.

[17]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[18]  Vivienne Sze,et al.  Hardware for machine learning: Challenges and opportunities , 2017, 2017 IEEE Custom Integrated Circuits Conference (CICC).

[19]  Carole-Jean Wu,et al.  DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[20]  Mikhail Smelyanskiy,et al.  FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference , 2021, ArXiv.

[21]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[22]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[23]  Ed H. Chi,et al.  Factorized Deep Retrieval and Distributed TensorFlow Serving , 2018 .

[24]  Michael Behar,et al.  Spring Hill (NNP-I 1000) Intel’s Data Center Inference Chip , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[25]  Steffen Rendle,et al.  Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.

[26]  Heng-Tze Cheng,et al.  Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.

[27]  Edouard Grave,et al.  Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[28]  Qiang Liu,et al.  Adaptive Dense-to-Sparse Paradigm for Pruning Online Recommendation System with Non-Stationary Data , 2020, ArXiv.

[29]  Kunle Olukotun,et al.  High-Accuracy Low-Precision Training , 2018, ArXiv.

[30]  Li Wei,et al.  Recommending what video to watch next: a multitask ranking system , 2019, RecSys.

[31]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[32]  Butler W. Lampson,et al.  There’s plenty of room at the Top: What will drive computer performance after Moore’s law? , 2020, Science.

[33]  Albert Gural,et al.  Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware , 2019, ArXiv.

[34]  Martin D. Schatz,et al.  Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.

[35]  Developing a Recommendation Benchmark for MLPerf Training and Inference , 2020, ArXiv.

[36]  Cody Coleman,et al.  MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[37]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Ivan V. Oseledets,et al.  Tensor methods and recommender systems , 2016, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[39]  Margaret Martonosi,et al.  Dynamically exploiting narrow width operands to improve processor power and performance , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[40]  Jiyan Yang,et al.  Post-Training 4-bit Quantization on Embedding Tables , 2019, ArXiv.

[41]  Natalie D. Enright Jerger,et al.  Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks , 2016, ICS.

[42]  Zhijian Liu,et al.  HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[44]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[45]  John Allen,et al.  Scuba: Diving into Data at Facebook , 2013, Proc. VLDB Endow..

[46]  Patrick Judd,et al.  Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.

[47]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[48]  Sherief Reda,et al.  Understanding the impact of precision quantization on the accuracy and energy of neural networks , 2016, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[49]  Alexander Finkelstein,et al.  Same, Same But Different - Recovering Neural Network Quantization Error Through Weight Factorization , 2019, ICML.

[50]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[51]  Andreas Moshovos,et al.  GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[52]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.