A Case Study for Fault Tolerance Oriented Programming in Multi-core Architecture

The multi-core architecture brings more and more challenges and means to common software developers. Reliable software system design approaches can give a high confidence that long-running online software systems run correctly. But anyway these approaches will certainly cause the loss of the efficiency. We found that the multi-core architecture is a quite suitable platform to support reliable software system design and can make the cost acceptable because of its advantages of the parallel performance and prevalence. In this paper we make use of the multi-core architecture to support software fault tolerance. This approach will make the integration of software fault tolerance and the multi-core architecture as a common design choice. According to the idea of software fault tolerance, for some key software units in a system we can develop N separate versions of them with equivalent functionalities. Each version is developed independently by an isolated group to prevent identical faults among versions. All implemented versions run separately from same initial conditions and inputs. Outputs of all redundant versions are submitted to a decision module that determines a single result from multiple results as the correct output. In this paper, we give a case study to show that with the multi-core architecture, the redundant versions of a key software unit can run in parallel on different cores to improve the efficiency.

[1]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[2]  Paul Ammann,et al.  Data Diversity: An Approach to Software Fault Tolerance , 1988, IEEE Trans. Computers.

[3]  Jean Arlat,et al.  Architectural Issues in Software Fault Tolerance , 1995 .

[4]  David F. McAllister,et al.  Fault-Tolerant SoFtware Reliability Modeling , 1987, IEEE Transactions on Software Engineering.

[5]  Thomas I. McVittie,et al.  Implementing design diversity to achieve fault tolerance , 1991, IEEE Software.

[6]  E. Fehlberg,et al.  Classical fourth- and lower order Runge-Kutta formulas with stepsize control and their application to heat transfer problems , 1970 .

[7]  Brian Randell,et al.  The Evolution of the Recovery Block Concept , 1994 .

[8]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[9]  Algirdas A. Avi The Methodology of N-Version Programming , 1995 .

[10]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[11]  Y. C. Yeh,et al.  Triple-triple redundant 777 primary flight computer , 1996, 1996 IEEE Aerospace Applications Conference. Proceedings.

[12]  Michael T. Heath,et al.  Scientific Computing: An Introductory Survey , 1996 .

[13]  Peter G. Bishop Software Fault Tolerance by Design Diversity , 1995 .

[14]  Jianhua Zhao,et al.  A case study for monitoring-oriented programming in multi-core architecture , 2008, IWMSE '08.

[15]  Daniel S. Katz,et al.  Software Fault Tolerance for Low-to-Moderate Radiation Environments , 2000 .