Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale
暂无分享,去创建一个
Jianyu Huang | Carole-Jean Wu | Mikhail Smelyanskiy | Ping Tak Peter Tang | Maxim Naumov | Hector Yuen | Jongsoo Park | Changkyu Kim | Dhruv Choudhary | Sam Naghshineh | Zhaoxia Deng | Daya Khudia | Ellie Wen | Xiaohan Wei | Haixin Liu | Jie Yang | Raghuraman Krishnamoorthi | Satish Nadathur | Zhaoxia Deng | Jongsoo Park | M. Smelyanskiy | Carole-Jean Wu | Changkyu Kim | P. T. P. Tang | Raghuraman Krishnamoorthi | M. Naumov | Hector Yuen | Jianyu Huang | J. Yang | Ellie Wen | S. Naghshineh | Dhruv Choudhary | D. Khudia | Xiaohan Wei | Haixin Liu | Satish Nadathur
[1] Chang Zhou,et al. Deep Interest Evolution Network for Click-Through Rate Prediction , 2018, AAAI.
[2] Carole-Jean Wu,et al. Understanding Capacity-Driven Scale-Out Neural Recommendation Inference , 2020, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[3] Raghuraman Krishnamoorthi,et al. Quantizing deep convolutional networks for efficient inference: A whitepaper , 2018, ArXiv.
[4] David M. Brooks,et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[5] Kushal Datta,et al. Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.
[6] Kilian Q. Weinberger,et al. Feature hashing for large scale multitask learning , 2009, ICML '09.
[7] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[8] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[9] David Patterson,et al. MLPerf Training Benchmark , 2019, MLSys.
[10] Carole-Jean Wu,et al. Cross-Stack Workload Characterization of Deep Recommendation Systems , 2020, 2020 IEEE International Symposium on Workload Characterization (IISWC).
[11] Ping Li,et al. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.
[12] Gang Fu,et al. Deep & Cross Network for Ad Click Predictions , 2017, ADKDD@KDD.
[13] Guorui Zhou,et al. Deep Interest Network for Click-Through Rate Prediction , 2017, KDD.
[14] Jie. Training Deep Learning Recommendation Model with Quantized Collective Communications , 2020 .
[15] Carole-Jean Wu,et al. The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[16] Carole-Jean Wu,et al. MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance , 2020, IEEE Micro.
[17] Samuel Williams,et al. Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.
[18] Vivienne Sze,et al. Hardware for machine learning: Challenges and opportunities , 2017, 2017 IEEE Custom Integrated Circuits Conference (CICC).
[19] Carole-Jean Wu,et al. DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[20] Mikhail Smelyanskiy,et al. FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference , 2021, ArXiv.
[21] Yinghai Lu,et al. Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.
[22] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.
[23] Ed H. Chi,et al. Factorized Deep Retrieval and Distributed TensorFlow Serving , 2018 .
[24] Michael Behar,et al. Spring Hill (NNP-I 1000) Intel’s Data Center Inference Chip , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).
[25] Steffen Rendle,et al. Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.
[26] Heng-Tze Cheng,et al. Wide & Deep Learning for Recommender Systems , 2016, DLRS@RecSys.
[27] Edouard Grave,et al. Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.
[28] Qiang Liu,et al. Adaptive Dense-to-Sparse Paradigm for Pruning Online Recommendation System with Non-Stationary Data , 2020, ArXiv.
[29] Kunle Olukotun,et al. High-Accuracy Low-Precision Training , 2018, ArXiv.
[30] Li Wei,et al. Recommending what video to watch next: a multitask ranking system , 2019, RecSys.
[31] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[32] Butler W. Lampson,et al. There’s plenty of room at the Top: What will drive computer performance after Moore’s law? , 2020, Science.
[33] Albert Gural,et al. Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware , 2019, ArXiv.
[34] Martin D. Schatz,et al. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.
[35] Developing a Recommendation Benchmark for MLPerf Training and Inference , 2020, ArXiv.
[36] Cody Coleman,et al. MLPerf Inference Benchmark , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[37] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[38] Ivan V. Oseledets,et al. Tensor methods and recommender systems , 2016, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..
[39] Margaret Martonosi,et al. Dynamically exploiting narrow width operands to improve processor power and performance , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[40] Jiyan Yang,et al. Post-Training 4-bit Quantization on Embedding Tables , 2019, ArXiv.
[41] Natalie D. Enright Jerger,et al. Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks , 2016, ICS.
[42] Zhijian Liu,et al. HAQ: Hardware-Aware Automated Quantization With Mixed Precision , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[44] Swagath Venkataramani,et al. PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.
[45] John Allen,et al. Scuba: Diving into Data at Facebook , 2013, Proc. VLDB Endow..
[46] Patrick Judd,et al. Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation , 2020, ArXiv.
[47] Yehuda Koren,et al. Matrix Factorization Techniques for Recommender Systems , 2009, Computer.
[48] Sherief Reda,et al. Understanding the impact of precision quantization on the accuracy and energy of neural networks , 2016, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.
[49] Alexander Finkelstein,et al. Same, Same But Different - Recovering Neural Network Quantization Error Through Weight Factorization , 2019, ICML.
[50] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).
[51] Andreas Moshovos,et al. GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[52] Joaquin Quiñonero Candela,et al. Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.