Assessing Fault Sensitivity in MPI Applications

Today, clusters built from commodity PCs dominate high-performance computing, with systems containing thousands of processors now being deployed. As node counts for multi-teraflop systems grow to thousands and with proposed petaflop system likely to contain tens of thousands of nodes, the standard assumption that system hardware and software are fully reliable becomes much less credible. Concomitantly, understanding application sensitivity to system failures is critical to establishing confidence in the outputs of large-scale applications. Using software fault injection, we simulated single bit memory errors, register file upsets and MPI message payload corruption and measured the behavioral responses for a suite of MPI applications. These experiments showed that most applications are very sensitive to even single errors. Perhaps most worrisome, the errors were often undetected, yielding erroneous output with no user indicators. Encouragingly, even minimal internal application error checking and program assertions can detect some of the faults we injected.

[1]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[2]  James F. Ziegler,et al.  Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[3]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[4]  Ronald Minnich,et al.  A Network-Failure-Tolerant Message-Passing System for Terascale Clusters , 2002, ICS '02.

[5]  SkjellumAnthony,et al.  A high-performance, portable implementation of the MPI message passing interface standard , 1996 .

[6]  Craig Partridge,et al.  When the CRC and TCP checksum disagree , 2000, SIGCOMM.

[7]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[8]  Manuel Blum,et al.  Software reliability via run-time result-checking , 1997, JACM.

[9]  Cristian Constantinescu,et al.  Teraflops Supercomputer: Architecture and Validation of the Fault Tolerance Mechanisms , 2000, IEEE Trans. Computers.

[10]  A. Winsor Sampling techniques. , 2000, Nursing times.

[11]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[12]  Gérard M. Baudet,et al.  Asynchronous Iterative Methods for Multiprocessors , 1978, JACM.

[13]  Christian Engelmann,et al.  Development of Naturally Fault Tolerant Algorithms for Computing on 100,000 Processors , 2002 .

[14]  Intel Corportation,et al.  IA-32 Intel Architecture Software Developers Manual , 2004 .

[15]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[16]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[17]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[18]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[19]  Laxmikant V. Kale,et al.  NAMD2: Greater Scalability for Parallel Molecular Dynamics , 1999 .

[20]  Henrique Madeira,et al.  Assessing the effects of communication faults on parallel applications , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[21]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[22]  Henrique Madeira,et al.  Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[23]  Ronald Minnich,et al.  A network-failure-tolerant message-passing system for terascale clusters , 2002, ICS '02.

[24]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[25]  Nicholas Nethercote,et al.  Valgrind: A Program Supervision Framework , 2003, RV@CAV.

[26]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[27]  G. Allen,et al.  The Cactus Code: a problem solving environment for the grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[28]  Alan Bundy,et al.  Constructing Induction Rules for Deductive Synthesis Proofs , 2006, CLASE.

[29]  P. L. Springer Analysis of application behavior during fault injection , 2001 .

[30]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .