FT-BLAS: a high performance BLAS implementation with online fault tolerance

Basic Linear Algebra Subprograms (BLAS) is a core library in scientific computing and machine learning. This paper presents FT-BLAS, a new implementation of BLAS routines that not only tolerates soft errors on the fly, but also provides comparable performance to modern state-of-the-art BLAS libraries on widely-used processors such as Intel Skylake and Cascade Lake. To accommodate the features of BLAS, which contains both memory-bound and computing-bound routines, we propose a hybrid strategy to incorporate fault tolerance into our brand-new BLAS implementation: duplicating computing instructions for memory-bound Level-1 and Level-2 BLAS routines and incorporating an Algorithm-Based Fault Tolerance mechanism for computing-bound Level-3 BLAS routines. Our high performance and low overhead are obtained from delicate assembly-level optimization and a kernel-fusion approach to the computing kernels. Experimental results demonstrate that FT-BLAS offers high reliability and high performance -- faster than Intel MKL, OpenBLAS, and BLIS by up to 3.50%, 22.14% and 21.70%, respectively, for routines spanning all three levels of BLAS we benchmarked, even under hundreds of errors injected per minute.

[1]  Israel Koren,et al.  CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators , 2017, Conf. Computing Frontiers.

[2]  Zizhong Chen,et al.  A survey of power and energy efficient techniques for high performance numerical linear algebra operations , 2014, Parallel Comput..

[3]  Zizhong Chen,et al.  On-line soft error correction in matrix-matrix multiplication , 2013, J. Comput. Sci..

[4]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[5]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Zizhong Chen Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Robert A. van de Geijn,et al.  BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..

[8]  Carlos R. P. Hartmann,et al.  A novel concurrent error detection scheme for FFT networks , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[9]  G. Powers,et al.  A Description of the Advanced Research WRF Version 3 , 2008 .

[10]  Shuaiwen Song,et al.  New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.

[11]  John Shalf,et al.  DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .

[12]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[13]  William Gropp,et al.  Towards a More Complete Understanding of SDC Propagation , 2017, HPDC.

[14]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[15]  Al Geist,et al.  Supercomputing's monster in the closet , 2016, IEEE Spectrum.

[16]  Meeta Sharma Gupta,et al.  Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Jeffrey S. Vetter,et al.  Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory , 2016, HPDC.

[18]  Franck Cappello,et al.  Improving performance of iterative methods by lossy checkponting , 2018, HPDC.

[19]  Zizhong Chen,et al.  Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[20]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[21]  Peter Y.-T. Hsu,et al.  Highly concurrent scalar processing , 1986, ISCA '86.

[22]  Dingwen Tao,et al.  Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra , 2016, HPDC.

[23]  Franck Cappello,et al.  Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.

[24]  Wei Tang,et al.  CutQC: using small Quantum computers for large Quantum circuit evaluations , 2020, ASPLOS.

[25]  Jack Dongarra,et al.  LAPACK Users' guide (third ed.) , 1999 .

[26]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[27]  Zizhong Chen,et al.  FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.

[28]  Franck Cappello,et al.  FT-iSort: efficient fault tolerance for introsort , 2019, SC.

[29]  Andrew A. Chien,et al.  Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience , 2015, ICCS.

[30]  Richard W. Vuduc,et al.  Self-stabilizing iterative solvers , 2013, ScalA '13.

[31]  Dingwen Tao,et al.  Correcting soft errors online in fast fourier transform , 2017, SC.

[32]  Franck Cappello,et al.  Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[33]  Israel Koren,et al.  Experimental and Analytical Study of Xeon Phi Reliability , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[34]  Eric Cheng,et al.  The resilience wall: Cross-layer solution strategies , 2014, Technical Papers of 2014 International Symposium on VLSI Design, Automation and Test.

[35]  Zizhong Chen,et al.  Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition , 2015, IEEE Transactions on Parallel and Distributed Systems.

[36]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[37]  Jing Yu,et al.  ESoftCheck: Removal of Non-vital Checks for Fault Tolerance , 2009, 2009 International Symposium on Code Generation and Optimization.

[38]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[39]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[40]  Zhi Chen,et al.  SIMD-based soft error detection , 2016, Conf. Computing Frontiers.

[41]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[42]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[43]  Robert A. van de Geijn,et al.  Fault-tolerant high-performance matrix multiplication: theory and practice , 2001, 2001 International Conference on Dependable Systems and Networks.

[44]  Dingwen Tao,et al.  Extending checksum-based ABFT to tolerate soft errors online in iterative methods , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[45]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[46]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[47]  Dong Li,et al.  Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Shuaiwen Song,et al.  Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[49]  Zizhong Chen,et al.  GreenMM: energy efficient GPU matrix multiplication through undervolting , 2019, ICS.

[50]  Robyn R. Lutz,et al.  Analyzing software requirements errors in safety-critical, embedded systems , 1993, [1993] Proceedings of the IEEE International Symposium on Requirements Engineering.

[51]  Song Fu,et al.  F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[52]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[53]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[54]  Rakesh Kumar,et al.  Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[55]  Zizhong Chen,et al.  A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[56]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[57]  Guanpeng Li,et al.  Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[58]  Mariagiovanna Sami,et al.  Fault tolerance in FFT arrays: Time redundancy approaches , 1990, J. VLSI Signal Process..

[59]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[60]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.