First-Generation Inference Accelerator Deployment at Facebook

In this paper, we provide a deep dive into the deployment of inference accelerators at Facebook. Many of our ML workloads have unique characteristics, such as sparse memory accesses, large model sizes, as well as high compute, memory and network bandwidth requirements. We co-designed a high-performance, energy-efficient inference accelerator platform based on these requirements. We describe the inference accelerator platform ecosystem we developed and deployed at Facebook: both hardware, through Open Compute Platform (OCP), and software framework and tooling, through Pytorch/Caffe2/Glow. A characteristic of this ecosystem from the start is its openness to enable a variety of AI accelerators from different vendors. This platform, with six low-power accelerator cards alongside a singlesocket host CPU, allows us to serve models of high complexity that cannot be easily or efficiently run on CPUs. We describe various performance optimizations, at both platform and accelerator level, which enables this platform to serve production traffic at Facebook. We also share deployment challenges, lessons learned during performance optimization, as well as provide guidance for future inference hardware co-design.

Yinghai Lu | Martin D. Schatz | Michael Gschwind | Michael J. Anderson | Hector Yuen | Aravind Kalaiah | Peter Tang | Jongsoo Park | Summer Deng | Nadathur Satish | Olof Johansson | Narayanan Sundaram | Changkyu Kim | Garret Catron | Abhishek Dhanotia | Jordan Fix | Nick Gibson | Wenyin Fu | Avinash Nayak | Sam Naghshineh | et al. | Harsha Bojja | Aravind Anbudurai | Ying Zhang | Jason Liang | Shishir Juluri | Jaewon Lee | Adi Gangidi | Benny Chen | Stephen Chen | Haixin Liu | Jack Montgomery | Arun Moorthy | Chris Petersen | Martin Schatz | Bangsheng Tang | Amy Yang | Jiecao Yu | Vandana Balan | Joe Boyd | Matthew Breitbach | Claudio Caldato | Anna Calvo | Sneh Chandwani | Panos Christeas | Brad Cottel | Brian Coutinho | Arun Dalli | Oniel Duncan | Roman Dzhabarov | Simon Elmir | Chunli Fu | Michael Fulthorp | Sean Gordon | Beatriz Padilla Hernandez | Daniel Ho | Yu-Cheng Huang | Peter Tang | Jordan Fix | Summer Deng | N. Satish | Jongsoo Park | Y. Zhang | E. al. | Changkyu Kim | N. Sundaram | M. Gschwind | Jiecao Yu | Hector Yuen | A. Moorthy | O. Johansson | A. Nayak | A. Kalaiah | Adi Gangidi | S. Naghshineh | Bangsheng Tang | Jason Liang | Chunli Fu | Harsha Bojja | Jaewon Lee | Bradford Cottel | Yinghai Lu | A. Dhanotia | J. Boyd | Benny Chen | Stephen Chen | Haixin Liu | Jack Montgomery | Chris Petersen | A. Yang | Aravind Anbudurai | Vandana Balan | Matthew Breitbach | Claudio Caldato | Anna Calvo | Garret Catron | Sneha Chandwani | Panos Christeas | Brian Coutinho | Arun Dalli | Oniel Duncan | R. Dzhabarov | Simon Elmir | Wenyin Fu | Michael Fulthorp | N. Gibson | Sean Gordon | Daniel Ho | Yu-Cheng Huang | Shishir Juluri | Aravind Kalaiah | J. Liang

[1]  David M. Brooks,et al.  Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[2]  Albert Gordo,et al.  Rosetta: Large Scale System for Text Detection and Recognition in Images , 2018, KDD.

[3]  David Patterson,et al.  A domain-specific supercomputer for training deep neural networks , 2020, Commun. ACM.

[4]  Carole-Jean Wu,et al.  The Architectural Implications of Facebook's DNN-Based Personalized Recommendation , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Xuehai Qian,et al.  AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Eriko Nurvitadhi,et al.  Scalable Multi-FPGA Acceleration for Large RNNs with Full Parallelism Levels , 2020, 2020 57th ACM/IEEE Design Automation Conference (DAC).

[8]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[9]  Steffen Rendle,et al.  Factorization Machines , 2010, 2010 IEEE International Conference on Data Mining.

[10]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.

[11]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Minsoo Rhu,et al.  TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning , 2019, MICRO.

[14]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[15]  Yinghai Lu,et al.  Deep Learning Recommendation Model for Personalization and Recommendation Systems , 2019, ArXiv.

[16]  Dongup Kwon,et al.  A Multi-Neural Network Acceleration Architecture , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[17]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[18]  Vivienne Sze,et al.  Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.

[19]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  Qiang Liu,et al.  Adaptive Dense-to-Sparse Paradigm for Pruning Online Recommendation System with Non-Stationary Data , 2020, ArXiv.

[21]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Hyoukjun Kwon,et al.  Heterogeneous Dataflow Accelerators for Multi-DNN Workloads , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[23]  Yuandong Tian,et al.  FBNetV3: Joint Architecture-Recipe Search using Neural Acquisition Function , 2020, ArXiv.

[24]  Yoav Shoham,et al.  The Cost of Training NLP Models: A Concise Overview , 2020, ArXiv.

[25]  Carole-Jean Wu,et al.  MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance , 2020, IEEE Micro.

[26]  Bertrand A. Maher,et al.  Glow: Graph Lowering Compiler Techniques for Neural Networks , 2018, ArXiv.

[27]  Sungroh Yoon,et al.  Memory-Augmented Neural Networks on FPGA for Real-Time and Energy-Efficient Question Answering , 2021, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Dong Han,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[31]  Dipankar Das,et al.  Manna: An Accelerator for Memory-Augmented Neural Networks , 2019, MICRO.

[32]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[33]  Alexander Aiken,et al.  TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[34]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Carole-Jean Wu,et al.  Understanding Capacity-Driven Scale-Out Neural Recommendation Inference , 2020, 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[36]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[37]  Shuicheng Yan,et al.  Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks With Octave Convolution , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Yujeong Choi,et al.  PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units , 2019, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[39]  Carole-Jean Wu,et al.  DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[40]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[41]  Martin D. Schatz,et al.  RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing , 2019, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[42]  Jiyan Yang,et al.  Post-Training 4-bit Quantization on Embedding Tables , 2019, ArXiv.

[43]  Song Han,et al.  SpArch: Efficient Architecture for Sparse Matrix Multiplication , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[44]  Martin D. Schatz,et al.  Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.

[45]  Du Tran,et al.  What Makes Training Multi-Modal Classification Networks Hard? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Pradeep Dubey,et al.  A Study of BFLOAT16 for Deep Learning Training , 2019, ArXiv.

[47]  Carole-Jean Wu,et al.  Cross-Stack Workload Characterization of Deep Recommendation Systems , 2020, 2020 IEEE International Symposium on Workload Characterization (IISWC).

[48]  Jaewon Lee,et al.  MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[49]  Minsoo Rhu,et al.  Centaur: A Chiplet-based, Hybrid Sparse-Dense Accelerator for Personalized Recommendations , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[50]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[51]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).