Simulated Annealing to Generate Numerically Stable Real Number Error Correction Codes

Checksum-based approaches can provide fault tolerance with lower overhead than general techniques. When there is a need for handling multiple simultaneous failures, real number coefficients are required to create multiple checksums. However, the exact solution for coefficients that will introduce the least amount of error is not known, and a method is required to generate usable coefficients. In this paper, we use an evolutionary algorithm to create coefficients with good numerical stability while using as little computation time as possible.

[1]  Zizhong Chen,et al.  Algorithmic Cholesky factorization fault recovery , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[3]  Zizhong Chen,et al.  Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[4]  Zizhong Chen,et al.  Optimal real number codes for fault tolerant matrix operations , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.

[6]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[7]  Hui Liu,et al.  High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.

[8]  Zizhong Chen,et al.  Numerically Stable Real Number Codes Based on Random Matrices , 2005, International Conference on Computational Science.

[9]  Zizhong Chen,et al.  Correcting soft errors online in LU factorization , 2013, HPDC '13.

[10]  Zizhong Chen Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[11]  Zizhong Chen,et al.  Constructing numerically stable real number codes using evolutionary computation , 2010, GECCO '10.