A Cost-Effective Forward Recovery Checkpointing Scheme in Multiprocessor Systems

This paper proposes a novel and cost-effective forward recovery checkpointing scheme for multiprocessor systems with duplex modular redundancy. In our scheme, one processing module is selected to retry the questionable checkpoint, and the other processing module executes toward the next checkpoint if a mismatched comparison between the two processing modules occurs at any checkpoint. Those schemes using a spare module to retry need much time to initiate the module, and the extra cost is high. Although the traditional rollback scheme retries the questionable checkpoint without any spare module, it has longer average completion time than our scheme for a job under any fault distribution. In our scheme, besides transient faults, permanent faults can be located as well. Experimental results based on our mathematical models demonstrate that, under burst errors, the average completion time of our scheme is reduced by 50% compared with that of the traditional rollback and is comparable with that of the scheme using a spare module to retry. In addition, our scheme has the least total execution time (the most cost-effectiveness) among the three schemes under any fault distribution.

[1]  Ronald A. Howard,et al.  Dynamic Probabilistic Systems , 1971 .

[2]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[3]  Dhiraj K. Pradhan,et al.  Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture , 1994, IEEE Trans. Computers.

[4]  Kang G. Shin,et al.  An Optimal Retry Policy Based on Fault Classification , 1994, IEEE Trans. Computers.

[5]  Imtiaz Ahmad,et al.  An efficient recovery procedure for fault tolerance in distributed systems , 1994, J. Syst. Softw..

[6]  Michel Banâtre,et al.  Lessons from FTM: An Experiment in Design and Implementation of a Low-Cost Fault-Tolerant System , 1996, IEEE Trans. Reliab..

[7]  Yong Deng,et al.  Checkpointing and rollback-recovery algorithms in distributed systems , 1994, J. Syst. Softw..

[8]  Kang G. Shin,et al.  Error Detection Process - Model, Design, and Its Impact on Computer Performance , 1984, IEEE Trans. Computers.

[9]  Prathima Agrawal,et al.  Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy , 1988, IEEE Trans. Computers.

[10]  P. C. Sharma,et al.  Modular TMR multiprocessor system , 1989 .

[11]  Jacob A. Abraham,et al.  Forward Recovery Using Checkpointing in Parallel Systems , 1990, ICPP.

[12]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[13]  T. Yamada,et al.  In-orbit experiment on the fault-tolerant space computer aboard the satellite Hiten , 1996, IEEE Trans. Reliab..

[14]  Jehoshua Bruck,et al.  Analysis of checkpointing schemes for multiprocessor systems , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[15]  Hagbae Kim,et al.  A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods , 1994, IEEE Trans. Computers.

[16]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[17]  Kameyama,et al.  Design of Dependent-Failure-Tolerant Microcomputer System Using Triple-Modular Redundancy , 1980, IEEE Transactions on Computers.

[18]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.