Byzantine Anomaly Testing for Charm++: Providing Fault Tolerance and Survivability for Charm++ Empowered Clusters

Recently shifts in high-performance computing have increased the use of clusters built around cheap commodity processors. A typical cluster consists of individual nodes, containing one or several processors, connected together with a high-bandwidth, low-latency interconnect. There are many benefits to using clusters for computation, but also some drawbacks, including a tendency to exhibit low Mean Time To Failure (MTTF) due to the sheer number of components involved. Recently, a number of fault-tolerance techniques have been proposed and developed to mitigate the inherent unreliability of clusters. These techniques, however, fail to address the issue of detecting non-obvious faults, particularly Byzantine faults. At present, effectively detecting Byzantine faults is an open problem. We describe the operation of ByzwATCh, a module for run-time detecting Byzantine hardware errors as part of the Charm++ parallel programming framework

[1]  Laxmikant V. Kale,et al.  Proactive Fault Tolerance in Large Systems , 2004 .

[2]  Michael Treaster,et al.  A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2004, ArXiv.

[3]  R. Dixon,et al.  The n-queens problem , 1975, Discret. Math..

[4]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[5]  Michael Nicolaidis,et al.  Embedded robustness IPs for transient-error-free ICs , 2002, IEEE Design & Test of Computers.

[6]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[7]  G. Robert Redinbo,et al.  Fault-tolerant FFT data compression , 2000, Proceedings. 2000 Pacific Rim International Symposium on Dependable Computing.

[8]  Neeraj Suri,et al.  Advances in ULTRA-Dependable Distributed Systems , 1994 .

[9]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  Tracy Larrabee,et al.  Beyond the byzantine generals: unexpected behavior and bridging fault diagnosis , 1996, Proceedings International Test Conference 1996. Test and Design Validity.

[11]  Sean Keller,et al.  SafeMPI - Extending MPI for Byzantine Error Detection on Parallel Clusters , 2005, ArXiv.

[12]  Michael K. Reiter,et al.  Fault detection for Byzantine quorum systems , 1999, Dependable Computing for Critical Applications 7.

[13]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[14]  Håkan Sivencrona,et al.  Byzantine Fault Tolerance, from Theory to Reality , 2003, SAFECOMP.

[15]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[16]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[17]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[18]  Douglas M. Blough,et al.  Fault-injection-based testing of fault-tolerant algorithms in message-passing parallel computers , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.