Low Latency Fault Tolerance System

The Low Latency Fault Tolerance (LLFT) system provides fault tolerance for distributed applications, using the leader-follower replication technique. The LLFT system provides application-transparent replication, with strong replica consistency, for applications that involve multiple interacting processes or threads. The LLFT system comprises a Low Latency Messaging Protocol, a Leader-Determined Membership Protocol, and a Virtual Determinizer Framework. The Low Latency Messaging Protocol provides reliable, totally ordered message delivery by employing a direct group-to-group multicast, where the message ordering is determined by the primary replica in the group. The Leader-Determined Membership Protocol provides reconfiguration and recovery when a replica becomes faulty and when a replica joins or leaves a group, where the membership of the group is determined by the primary replica. The Virtual Determinizer Framework captures the ordering information at the primary replica and enforces the same ordering at the backup replicas for major sources of non-determinism, including multi-threading, time-related operations and socket communication. The LLFT system achieves low latency message delivery during normal operation and low latency reconfiguration and recovery when a fault occurs.

[1]  Louise E. Moser,et al.  Design and implementation of a consistent time service for fault-tolerant distributed systems , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[2]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[3]  Louise E. Moser,et al.  Extended virtual synchrony , 1994, 14th International Conference on Distributed Computing Systems.

[4]  Louise E. Moser,et al.  Building Dependable and Secure Web Services , 2007, J. Softw..

[5]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[6]  Newtop: a fault-tolerant group communication protocol , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[7]  Wenbing Zhao,et al.  A lightweight fault tolerance framework for Web services , 2009, Web Intell. Agent Syst..

[8]  Andrew S. Tanenbaum,et al.  Group communication in the Amoeba distributed operating system , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[9]  Louise E. Moser,et al.  Totem: a fault-tolerant multicast group communication system , 1996, CACM.

[10]  Yair Amir,et al.  Membership Algorithms for Multicast Communication Groups , 1992, WDAG.

[11]  Journal of the Association for Computing Machinery , 1961, Nature.

[12]  Nancy A. Lynch,et al.  Perspectives on the CAP Theorem , 2012, Computer.

[13]  Robbert van Renesse,et al.  Adding high availability and autonomic behavior to Web services , 2004, Proceedings. 26th International Conference on Software Engineering.

[14]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[15]  Farnam Jahanian,et al.  A Real-Time Primary-Backup Replication Service , 1999, IEEE Trans. Parallel Distributed Syst..

[16]  Louise E. Moser,et al.  Fault Tolerance Middleware for Cloud Computing , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[17]  SchiperAndré,et al.  The implementation of a CORBA object group service , 1998 .

[18]  Louise E. Moser,et al.  The Totem single-ring ordering and membership protocol , 1995, TOCS.

[19]  Robbert van Renesse,et al.  Tempest: Soft state replication in the service tier , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[20]  Matti A. Hiltunen,et al.  Coyote: a system for constructing fine-grain configurable communication services , 1998, TOCS.

[21]  Ricardo Jiménez-Peris,et al.  WS-replication: a framework for highly available web services , 2006, WWW '06.

[22]  Xavier Défago,et al.  Semi-passive replication and Lazy Consensus , 2004, J. Parallel Distributed Comput..

[23]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[24]  Rachid Guerraoui,et al.  Throughput optimal total order broadcast for cluster environments , 2010, TOCS.

[25]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[26]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[27]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1998, IEEE Trans. Parallel Distributed Syst..

[28]  Andrey Brito,et al.  Minimizing Latency in Fault-Tolerant Distributed Stream Processing Systems , 2009, 2009 29th IEEE International Conference on Distributed Computing Systems.

[29]  William H. Sanders,et al.  AQuA: an adaptive architecture that provides dependable distributed objects , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[30]  Andrey Brito,et al.  Multithreading-Enabled Active Replication for Event Stream Processing Operators , 2009, 2009 28th IEEE International Symposium on Reliable Distributed Systems.

[31]  Christof Fetzer,et al.  Perfect Failure Detection in Timed Asynchronous Systems , 2003, IEEE Trans. Computers.

[32]  Priya Narasimhan,et al.  Enforcing determinism for the consistent replication of multithreaded CORBA applications , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[33]  Priya Narasimhan,et al.  Strongly consistent replication and recovery of fault-tolerant CORBA applications , 2002, Comput. Syst. Sci. Eng..

[34]  Yasushi Saito,et al.  Optimistic replication , 2005, CSUR.

[35]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[36]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[37]  Roberto Baldoni,et al.  An Interoperable Replication Logic for CORBA systems , 2000, Proceedings DOA'00. International Symposium on Distributed Objects and Applications.

[38]  Harrick M. Vin,et al.  A fault-tolerant java virtual machine , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[39]  Santosh K. Shrivastava,et al.  The Voltan application programming environment for fail-silent processes , 1998, Distributed Syst. Eng..

[40]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[41]  Rachid Guerraoui,et al.  Consensus in Asynchronous Distributed Systems: A Concise Guided Tour , 1999, Advances in Distributed Systems.

[42]  Santosh K. Shrivastava,et al.  The Design and Implementation of Arjuna , 1995, Comput. Syst..

[43]  Thomas C. Bressoud,et al.  TFT: a software system for application-transparent fault tolerance , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[44]  Simin Nadjm-Tehrani,et al.  Post-partition reconciliation protocols for maintaining consistency , 2006, SAC.

[45]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[46]  Xavier Défago,et al.  Semi-passive replication , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[47]  Louise E. Moser,et al.  Surviving Network Partitioning , 1998, Computer.

[48]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[49]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[50]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[51]  Roy Friedman,et al.  Transparent fault-tolerant Java virtual machine , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[52]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[53]  Lau Cheuk Lung,et al.  FTWeb: a fault tolerant infrastructure for Web services , 2005, Ninth IEEE International EDOC Enterprise Computing Conference (EDOC'05).

[54]  William R. Dieter,et al.  User-Level Checkpointing for LinuxThreads Programs , 2001, USENIX Annual Technical Conference, FREENIX Track.

[55]  Leslie Lamport,et al.  Vertical paxos and primary-backup replication , 2009, PODC '09.

[56]  Richard D. Schlichting,et al.  The Cactus Approach to Building Configurable Middleware Services , 2000 .

[57]  Claudiu Danilov,et al.  The Spread Toolkit: Architecture and Performance , 2004 .

[58]  Ravishankar K. Iyer,et al.  Active replication of multithreaded applications , 2006, IEEE Transactions on Parallel and Distributed Systems.

[59]  André Schiper,et al.  From set membership to group membership: a separation of concerns , 2006, IEEE Transactions on Dependable and Secure Computing.

[60]  Louise E. Moser,et al.  Design and Implementation of a Pluggable Fault-Tolerant CORBA Infrastructure , 2004, Cluster Computing.

[61]  Ricardo Jiménez-Peris,et al.  Deterministic scheduling for transactional multithreaded replicas , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.