Assessing the effects of communication faults on parallel applications

This paper addresses the problem of injection of faults in the communication system of disjoint memory parallel computers and presents fault injection results showing that 5% to 30% of the faults injected in the communication subsystem of a commercial parallel computer caused undetected errors that lead the application to generate erroneous results. All these cases correspond to situations in which it would be virtually impossible to detect that the benchmark output was erroneous, as the size of the results file was plausible and no system errors had been detected. This emphasizes the need for fault tolerant techniques in parallel systems in order to achieve confidence in the application results. This is especially true in massively parallel computers, as the probability of occurring faults increase with the number of processing nodes. Moreover, in disjoint memory computers, which is the most popular and scalable parallel architecture, the communication subsystem plays an important role, and is also very prone to errors. CSFI (Communication Software Fault Injector) is a versatile tool to inject communication faults in parallel computers. Faults injected with CSFI directly emulate communication faults and spurious messages generated by non fail-silent nodes by software, allowing the evaluation of the impact of faults in parallel systems and the assessment of fault tolerant techniques. The use of CSFI is nearly transparent to the target application as it only requires minor adaptations. Deterministic faults of different nature can be injected without user intervention and fault injection results are collected automatically by CSFI.<<ETX>>

[1]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[2]  Jaynarayan H. Lala Fault detection, isolation and reconfiguration ff fimp: methods and experimental results , 1983 .

[3]  Kang G. Shin,et al.  Software fault injection and its application in distributed systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[4]  Daniel P. Siewiorek,et al.  FIAT-fault injection based automated testing environment , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Jean Arlat,et al.  Fault Injection for Dependability Validation: A Methodology and Some Applications , 1990, IEEE Trans. Software Eng..

[6]  Johan Karlsson,et al.  Evaluation of error detection schemes using fault injection by heavy-ion radiation , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[8]  Ravishankar K. Iyer,et al.  A Measurement-Based Model for Workload Dependence of CPU Errors , 1986, IEEE Transactions on Computers.

[9]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[10]  D. Vrsalovic,et al.  FlAT -- Fault Injection Based Automated Testing Environment , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[11]  Daniel P. Siewiorek,et al.  Effects of transient gate-level faults on program behavior , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[12]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[13]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[14]  Jacob A. Abraham,et al.  FERRARI: a tool for the validation of system dependability properties , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.