High Performance Linear System Solver with Resilience to Multiple Soft Errors

In the multi-peta-flop era for supercomputers, the number of computing cores is growing exponentially. However, with integrated circuit technology scaling below 65 nm, the critical charge required to flip a gate or a memory cell is dangerously reduced. Combined with higher vulnerability to cosmic radiation, soft errors are expected to become anything but inevitable for modern supercomputer systems. As a result, for long running applications on high-end machines, including linear solvers for dense matrices, soft errors have become a serious concern. Classical checkpoint and restart (C/R) scheme loses effectiveness against this threat because of the difficulty to detect soft errors in the form of transient bit flips that do not interrupt program execution and therefore leave no trace of error occurrence. Current research of soft errors resilience for dense linear solvers offers limited capability when faced with large scale computing systems that suffer both round-off error from floating point arithmetic and the presence followed by propagation of multiple soft errors. The use of error correcting codes based on Galois fields requires high computing cost for recovery. This work proposes a fault tolernat algorithm for dense linear system solver that is resilient to multiple spatial and temporal soft errors. This algorithm is designed to work with floating point data and is capable of recovering the solution of Ax = b from multiple soft errors that affect any part of the matrix during computation. Additionally, the computational complexity of the error detection and recovery is optimized through novel methods. Experimental results on cluster systems confirm that the proposed fault tolerance functionality can successfully detect and locate soft errors and recover the solution of the linear system. The performance impact is negligible and the soft errors resilient algorithm’s performance scales well on large scale systems. Keywords-soft error; fault tolerance; multiple errors; dense linear system solver;

[1]  Haesun Park On Multiple Error Detection in Matrx Triangularizations Using Checksum Methods , 1992, J. Parallel Distributed Comput..

[2]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[3]  Colin C. Murphy,et al.  Fault tolerant matrix triangularization and solution of linear systems of equations , 1992, [1992] Proceedings of the International Conference on Application Specific Array Processors.

[4]  Mehdi Baradaran Tahoori,et al.  Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers , 2011, 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing.

[5]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[6]  P. Hazucha,et al.  Impact of CMOS technology scaling on the atmospheric neutron soft error rate , 2000 .

[7]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[8]  Jack J. Dongarra,et al.  Soft error resilient QR factorization for hybrid system with GPGPU , 2011, ScalA '11.

[9]  Martin Schulz,et al.  A Foundation for the Accurate Prediction of the Soft Error Vulnerability of Scientific Applications , 2009 .

[10]  Jack J. Dongarra,et al.  High Performance Dense Linear System Solver with Soft Error Resilience , 2011, 2011 IEEE International Conference on Cluster Computing.

[11]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[12]  Scott A. Mahlke,et al.  Reliability: Fallacy or Reality? , 2007, IEEE Micro.

[13]  Bronis R. de Supinski,et al.  Soft error vulnerability of iterative linear algebra methods , 2007, ICS '08.

[14]  Dennis Abts,et al.  Architectural Support for Mitigating DRAM Soft Errors in Large-Scale Supercomputers , 2007 .

[15]  Mahmut T. Kandemir,et al.  Analyzing the soft error resilience of linear solvers on multicore multiprocessors , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16]  Jacob A. Abraham,et al.  Fault Tolerance Techniques For Highly Parallel Signal Processing Architectures , 1986, Photonics West - Lasers and Applications in Science and Engineering.

[17]  Jin Qin,et al.  A study of scaling effects on DRAM reliability , 2011, 2011 Proceedings - Annual Reliability and Maintainability Symposium.

[18]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[19]  Dwijendra K. Ray-Chaudhuri,et al.  Binary mixture flow with free energy lattice Boltzmann methods , 2022, arXiv.org.

[20]  J.D. Cressler,et al.  Multiple-Bit Upset in 130 nm CMOS Technology , 2006, IEEE Transactions on Nuclear Science.

[21]  Franklin T. Luk,et al.  Fault-Tolerant Matrix Triangularizations on Systolic Arrays , 1988, IEEE Trans. Computers.

[22]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[23]  Franklin T. Luk,et al.  A Linear Algebraic Model of Algorithm-Based Fault Tolerance , 1988, IEEE Trans. Computers.

[24]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[25]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[26]  Eduardo F. D'Azevedo,et al.  Complex version of high performance computing LINPACK benchmark (HPL) , 2010, Concurr. Comput. Pract. Exp..