A language-driven tool for fault injection in distributed systems

In a network consisting of several thousands computers, the occurrence of faults is unavoidable. Being able to test the behavior of a distributed program in an environment where we can control the faults (such as the crash of a process) is an important feature that matters in the deployment of reliable programs. In this paper, we present FAIL (for FAult Injection Language), a language that permits to elaborate complex fault scenarios in a simple way, while relieving the user from writing low level code. Besides, it is possible to construct probabilistic scenarios (for average quantitative tests) or deterministic and reproducible scenarios (for studying the application's behavior in particular cases). We also present FCI, the FAIL cluster implementation, that consists of a compiler, a runtime library and a middleware platform for software fault injection in distributed applications. FCI is able to interface with numerous programming languages without requiring the modification of their source code, and the preliminary tests that we conducted show that its effective impact at runtime is low.

[1]  Boris Beizer,et al.  Software Testing Techniques , 1983 .

[2]  Farnam Jahanian,et al.  Testing of fault-tolerant and real-time distributed systems via protocol fault injection , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[3]  Douglas C. Schmidt,et al.  C++ Network Programming: Resolving Complexity Using Ace and Patterns (C++ in-Depth Series) , 2001 .

[4]  Douglas C. Schmidt,et al.  Reactor: an object behavioral pattern for concurrent event demultiplexing and event handler dispatching , 1995 .

[5]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[6]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[7]  Henrique Madeira,et al.  RIFLE: A General Purpose Pin-level Fault Injector , 1994, EDCC.

[8]  Ravishankar K. Iyer,et al.  NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors , 2000, Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000.

[9]  David E. Culler,et al.  The Mantis parallel debugger , 1996, SPDT '96.

[10]  Sudipto Ghosh,et al.  Issues in Testing Distributed Component-Based Systems , 1999 .

[11]  Karin Anna Hummel,et al.  Software implemented fault injection for safety-critical distributed systems by means of mobile agents , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[12]  Gilles Fedak,et al.  XtremWeb: a generic global computing system , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[13]  William H. Sanders,et al.  Loki: a state-driven fault injector for distributed systems , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[14]  Boris Beizer,et al.  Software testing techniques (2. ed.) , 1990 .

[15]  Johan Karlsson,et al.  GOOFI: generic object-oriented fault injection tool , 2001, 2001 International Conference on Dependable Systems and Networks.

[16]  Douglas C. Schmidt,et al.  Systematic reuse with ACE and frameworks , 2003 .

[17]  Gregg Rothermel,et al.  Performing data flow testing on classes , 1994, SIGSOFT '94.

[18]  Farnam Jahanian,et al.  ORCHESTRA: A Fault Injection Environment for Distributed Systems , 1996 .

[19]  John D. McGregor,et al.  Incremental testing of object-oriented class structures , 1992, ICSE '92.

[20]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[21]  Henrique Madeira,et al.  Xception: Software Fault Injection and Monitoring in Processor Functional Units1 , 1995 .

[22]  Frank Buschmann,et al.  C++ Network Programming: Systematic Reuse with ACE and Frameworks, Vol. 2 , 2002 .