Cluster Survivability with ByzwATCh: A Byzantine Hardware Fault Detector for Parallel Machines with Charm++

Modern high-performance computing relies heavily on the use of commodity processors arranged together in clusters. These clusters consist of individual nodes (typically off-the-shelf single or dual processor machines) connected together with a high speed interconnect. Using cluster computation has many benefits, but also carries the liability of being failure prone due to the sheer number of components involved. Many effective solutions have been proposed to aid failure recovery in clusters, however, they depend on these failures being detectable. At present, effectively detecting Byzantine faults is an open problem. We describe the operation of ByzwATCh, a module for run-time detecting byzantine hardware errors as part of the Charm++ parallel programming framework.

[1]  Michael K. Reiter,et al.  Fault detection for Byzantine quorum systems , 1999, Dependable Computing for Critical Applications 7.

[2]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[3]  Michael Nicolaidis,et al.  Embedded robustness IPs for transient-error-free ICs , 2002, IEEE Design & Test of Computers.

[4]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[5]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[6]  G. Robert Redinbo,et al.  Fault-tolerant FFT data compression , 2000, Proceedings. 2000 Pacific Rim International Symposium on Dependable Computing.

[7]  Neeraj Suri,et al.  Advances in ULTRA-Dependable Distributed Systems , 1994 .

[8]  Sean Keller,et al.  SafeMPI - Extending MPI for Byzantine Error Detection on Parallel Clusters , 2005, ArXiv.

[9]  Håkan Sivencrona,et al.  Byzantine Fault Tolerance, from Theory to Reality , 2003, SAFECOMP.

[10]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[11]  Tracy Larrabee,et al.  Beyond the byzantine generals: unexpected behavior and bridging fault diagnosis , 1996, Proceedings International Test Conference 1996. Test and Design Validity.

[12]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[13]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[14]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[15]  Michael Treaster,et al.  A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2004, ArXiv.

[16]  Laxmikant V. Kalé,et al.  A fault tolerant protocol for massively parallel systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[17]  Douglas M. Blough,et al.  Fault-injection-based testing of fault-tolerant algorithms in message-passing parallel computers , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.