Fault-tolerant clock synchronization of large multicomputers via multistep interactive convergence

We present a fault-tolerant algorithm that internally synchronizes clocks in multicomputer systems employing not completely connected networks (NCCNs). The algorithm is referred to as multistep interactive convergence, and is locally implemented in each node by a time sewer process (TSP). The algorithm proceeds in rounds, and bases its operation on a logical mapping of the system's TSPs into an m-dimensional array. A TSP executes m steps per round, each step including a call to an interactive convergence procedure. Clock readings in step i are gathered only from TSPs sharing a row along dimension i of the array, which reduces the number of messages by orders of magnitude over a conventional interactive convergence algorithm. The algorithm can be used in systems of arbitrary topology, and provides the added benefit of increased locality of communication in regular NCCNs. These advantages can be combined with a variety of message staggering mechanisms to maintain network contention at a minimum. We characterize the maximum clock skew maximum clock drift, maximum clock discontinuity, and number of messages produced by the algorithm, and show that it tolerates arbitrary faults. A comparison with other algorithms is provided.

[1]  Peter N. Marinos,et al.  Synchronization of Fault-Tolerant Clocks in the Presence of Malicious Failures , 1988, IEEE Trans. Computers.

[2]  Kang G. Shin,et al.  Fault-tolerant clock synchronization for distributed systems using continuous synchronization messages , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[3]  Fred B. Schneider,et al.  Inexact agreement: accuracy, precision, and graceful degradation , 1985, PODC '85.

[4]  Ricky W. Butler A survey of provably correct fault-tolerant clock synchronization techniques , 1988 .

[5]  Neeraj Suri,et al.  Synchronization issues in real-time systems , 1994 .

[6]  Parameswaran Ramanathan,et al.  Hardware-Assisted Software Clock Synchronization for Homogeneous Distributed Systems , 1990, IEEE Trans. Computers.

[7]  Hermann Kopetz,et al.  Clock Synchronization in Distributed Real-Time Systems , 1987, IEEE Transactions on Computers.

[8]  Nancy A. Lynch,et al.  Reaching approximate agreement in the presence of faults , 1986, JACM.

[9]  Ran Libeskind-Hadas,et al.  Fault-tolerant multicast routing in the mesh with no virtual channels , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[10]  Douglas M. Blough,et al.  A New and Improved Algorithm for Fault-Tolerant Clock Synchronization , 1995, J. Parallel Distributed Comput..

[11]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.

[12]  Flaviu Cristian,et al.  Continuous clock amortization need not affect the precision of a clock synchronization algorithm , 1990, PODC '90.

[13]  Paulo Veríssimo,et al.  A posteriori agreement for fault-tolerant clock synchronization on broadcast networks , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[14]  Danny Dolev,et al.  On the possibility and impossibility of achieving clock synchronization , 1984, STOC '84.

[15]  M.M. de Azevedo,et al.  Fault-tolerant clock synchronization for distributed systems with high message delay variation , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[16]  Youran Lan Adaptive Fault-Tolerant Multicast in Hypercube Multicomputers , 1994, J. Parallel Distributed Comput..

[17]  Augusto Ciuffoletti Using simple diffusion to synchronize the clocks in a distributed system , 1994, 14th International Conference on Distributed Computing Systems.

[18]  Nancy A. Lynch,et al.  A new fault-tolerant algorithm for clock synchronization , 1984, PODC '84.

[19]  Leslie Lamport,et al.  Synchronizing Time Servers , 1987 .

[20]  Neeraj Suri,et al.  Reliability modeling of large fault-tolerant systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[21]  Barbara Liskov,et al.  Practical uses of synchronized clocks in distributed systems , 1991, PODC '91.

[22]  Parameswaran Ramanathan,et al.  Reliable Broadcast in Hypercube Multicomputers , 1988, IEEE Trans. Computers.

[23]  Kang G. Shin,et al.  Fault-Tolerant Clock Synchronization in Large Multicomputer Systems , 1994, IEEE Trans. Parallel Distributed Syst..

[24]  Danny Dolev,et al.  Dynamic fault-tolerant clock synchronization , 1995, JACM.

[25]  Parameswaran Ramanathan,et al.  Fault-tolerant clock synchronization in distributed systems , 1990, Computer.

[26]  Parameswaran Ramanathan,et al.  Clock Synchronization of a Large Multiprocessor System in the Presence of Malicious Faults , 1987, IEEE Transactions on Computers.

[27]  Danny Dolev,et al.  The Byzantine Generals Strike Again , 1981, J. Algorithms.

[28]  R. Kieckhafer,et al.  Low Cost Approximate Agreement In Partially Connected Networks , 1993 .

[29]  Sam Toueg,et al.  Optimal clock synchronization , 1985, PODC '85.

[30]  Fred B. Schneider,et al.  Understanding Protocols for Byzantine Clock Synchronization , 1987 .

[31]  Flaviu Cristian,et al.  Clock Synchronization in the Presence of Omission and Performance Faults, and Processor Joins , 1986 .

[32]  Wei-Tek Tsai,et al.  Fault-Tolerant Multicasting on Hypercubes , 1994, J. Parallel Distributed Comput..

[33]  M. H. Schultz,et al.  Topological properties of hypercubes , 1988, IEEE Trans. Computers.

[34]  Douglas M. Blough,et al.  Communication protocols for fault-tolerant clock synchronization in not-completely connected networks , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[35]  Flaviu Cristian,et al.  Probabilistic internal clock synchronization , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[36]  Nancy A. Lynch,et al.  A New Fault-Tolerance Algorithm for Clock Synchronization , 1988, Inf. Comput..