Group communication protocol for flexible distributed systems

In large-scale distributed systems, the processes have to be upgraded to absorb the changes of user requirements and system environments. The system cannot be kept available by the conventional upgrading methods because multiple processes have to be suspended simultaneously. This paper discusses a new method where each process can invoke asynchronously the upgrading procedure. The key idea is that multiple versions of processes can be operated temporarily. Each pair of an old-version process and a new-version one are managed as one process group. The group communication protocol proposed supports the message transmission among the process groups. Moreover, the protocol detects protocol errors caused by the co-existence of multiple versions of processes. A checkpoint-rollback algorithm for resolving the protocol errors is proposed. By using the algorithm, the minimum number of processes are rolled back asynchronously. Hence, the system is highly available even if protocol error occurs.

[1]  Mario Barbacci,et al.  Application-level programming , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[2]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[3]  Hiroaki Higaki Group communications algorithm for dynamically updating in distributed systems , 1994, Proceedings of 1994 International Conference on Parallel and Distributed Systems.

[4]  Miroslaw Malek,et al.  Keynote address K3: Responsive systems (the challenge for the nineties) , 1990, Microprocessing and Microprogramming.

[5]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[6]  O. Frieder,et al.  Dynamically updating distributed software: supporting change in uncertain and mistrustful environments , 1989, Proceedings. Conference on Software Maintenance - 1989.

[7]  Makoto Takizawa,et al.  Checkpoint and rollback in asynchronous distributed systems , 1997, Proceedings of INFOCOM '97.

[8]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[9]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[10]  Jeff Magee,et al.  The Evolving Philosophers Problem: Dynamic Change Management , 1990, IEEE Trans. Software Eng..

[11]  Richard Y. Kain,et al.  Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks , 1992, IEEE Trans. Parallel Distributed Syst..

[12]  Edsger W. Dijkstra,et al.  Termination Detection for Diffusing Computations , 1980, Inf. Process. Lett..

[13]  Kenji Sugawara,et al.  Flexible Networks: Basic Concepts and Architecture , 1994 .

[14]  Hiroaki Higaki,et al.  Group communication for upgrading distributed programs , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[15]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[16]  Phil Kearns,et al.  Rollback based on vector time , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[17]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[18]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.