Implementation of Watch Dog Timer for Fault Tolerant Computing on Cluster Server

In today’s new technology era, cluster has become a necessity for the modern computing and data applications since many applications take more time (even days or months) for computation. Although after parallelization, computation speeds up, still time required for much application can be more. Thus, reliability of the cluster becomes very important issue and implementation of fault tolerant mechanism becomes essential. The difficulty in designing a fault tolerant cluster system increases with the difficulties of various failures. The most imperative obsession is that the algorithm, which avoids a simple failure in a system, must tolerate the more severe failures. In this paper, we implemented the theory of watchdog timer in a parallel environment, to take care of failures. Implementation of simple algorithm in our project helps us to take care of different types of failures; consequently, we found that the reliability of this cluster improves. Keywords—Cluster, Fault tolerant, Grid, Grid Computing System, Meta-computing.

[1]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[2]  Priya Narasimhan,et al.  Experiences, strategies, and challenges in building fault-tolerant CORBA systems , 2004, IEEE Transactions on Computers.

[3]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[4]  Ian T. Foster,et al.  A problem-specific fault-tolerance mechanism for asynchronous, distributed systems , 2000, Proceedings 2000 International Conference on Parallel Processing.

[5]  Roy Friedman,et al.  FTS: a high-performance CORBA fault-tolerance service , 2002, Proceedings of the Seventh IEEE International Workshop on Object-Oriented Real-Time Dependable Systems. (WORDS 2002).