Efficient Soft-Error Detection for Low-precision Deep Learning Recommendation Models

Soft error, namely silent corruption of signal or datum in a computer system, cannot be caverlierly ignored as compute and communication density grow exponentially. Soft error detection has been studied in the context of enterprise computing, high-performance computing and more recently in convolutional neural networks related to autonomous driving. Deep learning recommendation systems (DLRMs) have by now become ubiquitous and serve billions of users per day. Nevertheless, DLRM-specific soft error detection methods are hitherto missing. To fill the gap, this paper presents the first set of soft-error detection methods for low-precision quantizedarithmetic operators in DLRM including general matrix multiplication (GEMM) and EmbeddingBag. A practical method must detect error and do so with low overhead lest reduced inference speed degrades user experience. Exploiting the characteristics of both quantized arithmetic and the operators, we achieved more than 95% detection accuracy for GEMM with an overhead below 20%. For EmbeddingBag, we achieved 99% effectiveness in significant-bit-flips detection with less than 10% of false positives, while keeping overhead below 26%.

[1]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[2]  Stephen W. Keckler,et al.  Making Convolutions Resilient Via Algorithm-Based Error Detection Techniques , 2020, IEEE Transactions on Dependable and Secure Computing.

[3]  Dingwen Tao,et al.  Silent Data Corruption Resilient Two-sided Matrix Factorizations , 2017, PPoPP.

[4]  Jiyan Yang,et al.  Post-Training 4-bit Quantization on Embedding Tables , 2019, ArXiv.

[5]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[6]  Dingwen Tao,et al.  Correcting soft errors online in fast fourier transform , 2017, SC.

[7]  Debjit Das Sarma,et al.  Compute Solution for Tesla's Full Self-Driving Computer , 2020, IEEE Micro.

[8]  Franck Cappello,et al.  Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[9]  Kartheek Rangineni,et al.  ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[10]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Luigi Carro,et al.  Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs , 2019, IEEE Transactions on Reliability.

[12]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Bin Nie,et al.  Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[14]  Mikhail Smelyanskiy,et al.  FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference , 2021, ArXiv.

[15]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[16]  Al Geist,et al.  Supercomputing's monster in the closet , 2016, IEEE Spectrum.

[17]  Jingyuan Zhang,et al.  AIBox: CTR Prediction Model Training on a Single Node , 2019, CIKM.

[18]  Kai Zhao,et al.  Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Shuaiwen Song,et al.  New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.

[20]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[21]  Ping Li,et al.  Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems , 2020, MLSys.

[22]  Kai Ren,et al.  Kraken: Memory-Efficient Continual Learning for Large-Scale Real-Time Recommendations , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Sriram Sankar,et al.  Silent Data Corruptions at Scale , 2021, ArXiv.