FT-BLAS: a high performance BLAS implementation with online fault tolerance
暂无分享,去创建一个
Elisabeth Giem | Jinyang Liu | Kai Zhao | Zizhong Chen | Yujia Zhai | Quan Fan | Kai Zhao | Zizhong Chen | Jinyang Liu | Yujia Zhai | Elisabeth Giem | Quan Fan
[1] Israel Koren,et al. CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators , 2017, Conf. Computing Frontiers.
[2] Zizhong Chen,et al. A survey of power and energy efficient techniques for high performance numerical linear algebra operations , 2014, Parallel Comput..
[3] Zizhong Chen,et al. On-line soft error correction in matrix-matrix multiplication , 2013, J. Comput. Sci..
[4] T. May,et al. Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.
[5] Qian Wang,et al. AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[6] Zizhong Chen. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[7] Robert A. van de Geijn,et al. BLIS: A Framework for Rapidly Instantiating BLAS Functionality , 2015, ACM Trans. Math. Softw..
[8] Carlos R. P. Hartmann,et al. A novel concurrent error detection scheme for FFT networks , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.
[9] G. Powers,et al. A Description of the Advanced Research WRF Version 3 , 2008 .
[10] Shuaiwen Song,et al. New-Sum: A Novel Online ABFT Scheme For General Iterative Methods , 2016, HPDC.
[11] John Shalf,et al. DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .
[12] Laxmikant V. Kalé,et al. Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..
[13] William Gropp,et al. Towards a More Complete Understanding of SDC Propagation , 2017, HPDC.
[14] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[15] Al Geist,et al. Supercomputing's monster in the closet , 2016, IEEE Spectrum.
[16] Meeta Sharma Gupta,et al. Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Jeffrey S. Vetter,et al. Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory , 2016, HPDC.
[18] Franck Cappello,et al. Improving performance of iterative methods by lossy checkponting , 2018, HPDC.
[19] Zizhong Chen,et al. Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[20] Shubhendu S. Mukherjee,et al. Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[21] Peter Y.-T. Hsu,et al. Highly concurrent scalar processing , 1986, ISCA '86.
[22] Dingwen Tao,et al. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra , 2016, HPDC.
[23] Franck Cappello,et al. Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation , 2015, 2015 IEEE International Conference on Cluster Computing.
[24] Wei Tang,et al. CutQC: using small Quantum computers for large Quantum circuit evaluations , 2020, ASPLOS.
[25] Jack Dongarra,et al. LAPACK Users' guide (third ed.) , 1999 .
[26] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[27] Zizhong Chen,et al. FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.
[28] Franck Cappello,et al. FT-iSort: efficient fault tolerance for introsort , 2019, SC.
[29] Andrew A. Chien,et al. Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience , 2015, ICCS.
[30] Richard W. Vuduc,et al. Self-stabilizing iterative solvers , 2013, ScalA '13.
[31] Dingwen Tao,et al. Correcting soft errors online in fast fourier transform , 2017, SC.
[32] Franck Cappello,et al. Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.
[33] Israel Koren,et al. Experimental and Analytical Study of Xeon Phi Reliability , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[34] Eric Cheng,et al. The resilience wall: Cross-layer solution strategies , 2014, Technical Papers of 2014 International Symposium on VLSI Design, Automation and Test.
[35] Zizhong Chen,et al. Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition , 2015, IEEE Transactions on Parallel and Distributed Systems.
[36] Michael Nicolaidis. Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).
[37] Jing Yu,et al. ESoftCheck: Removal of Non-vital Checks for Fault Tolerance , 2009, 2009 International Symposium on Code Generation and Optimization.
[38] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[39] Edward J. McCluskey,et al. Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..
[40] Zhi Chen,et al. SIMD-based soft error detection , 2016, Conf. Computing Frontiers.
[41] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[42] Zizhong Chen,et al. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.
[43] Robert A. van de Geijn,et al. Fault-tolerant high-performance matrix multiplication: theory and practice , 2001, 2001 International Conference on Dependable Systems and Networks.
[44] Dingwen Tao,et al. Extending checksum-based ABFT to tolerate soft errors online in iterative methods , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).
[45] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.
[46] David I. August,et al. SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.
[47] Dong Li,et al. Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[48] Shuaiwen Song,et al. Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.
[49] Zizhong Chen,et al. GreenMM: energy efficient GPU matrix multiplication through undervolting , 2019, ICS.
[50] Robyn R. Lutz,et al. Analyzing software requirements errors in safety-critical, embedded systems , 1993, [1993] Proceedings of the IEEE International Symposium on Requirements Engineering.
[51] Song Fu,et al. F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[52] Edward J. McCluskey,et al. Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..
[53] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.
[54] Rakesh Kumar,et al. Algorithmic approaches to low overhead fault detection for sparse linear algebra , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[55] Zizhong Chen,et al. A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.
[56] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..
[57] Guanpeng Li,et al. Understanding Error Propagation in Deep Learning Neural Network (DNN) Accelerators and Applications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[58] Mariagiovanna Sami,et al. Fault tolerance in FFT arrays: Time redundancy approaches , 1990, J. VLSI Signal Process..
[59] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.
[60] Zizhong Chen,et al. Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.