Transparent fault-tolerant Java virtual machine

Replication is one of the prominent approaches for obtaining fault tolerance. Implementing replication on commodity hardware and in a transparent fashion, i.e., without changing the programming model, has many challenges. Deciding at what level to implement the replication has ramifications on development costs and portability of the programs. Other difficulties lie in the coordination of the copies in the face of non-determinism. We report on an implementation of transparent fault tolerance at the virtual machine level of Java. We describe the design of the system and present performance results that in certain cases are equivalent to those of non-replicated executions. We also discuss design decisions stemming from implementing replication at the virtual machine level, and the special considerations necessary in order to support symmetric multi-processors (SMP).

[1]  Paulo Veríssimo,et al.  The Delta-4 extra performance architecture (XPA) , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[2]  Idit Keidar TOTALLYORDERED BROADCAST IN THE FACE OF NETWORK PARTITIONS Exploiting Group Communication for Replication in Partitionable Networks , 1999 .

[3]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[6]  Ricardo Jiménez-Peris,et al.  Deterministic scheduling for transactional multithreaded replicas , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[7]  Harrick M. Vin,et al.  A fault-tolerant java virtual machine , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[8]  Charles B. Weinstock,et al.  A Conceptual Framework for System Fault Tolerance , 1992 .

[9]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[10]  Jong-Deok Choi,et al.  Deterministic replay of distributed Java applications , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[11]  Yi-Min Wang,et al.  ONE-IP: Techniques for Hosting a Service on a Cluster of Machines , 1997, Comput. Networks.

[12]  Idit Keidar,et al.  Totally ordered broadcast in the face of network partitions: exploiting group communication for repl , 1999 .

[13]  Özalp Babaoglu,et al.  RELACS: A communications infrastructure for constructing reliable applications in large-scale distributed systems , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[14]  Greg Minshall,et al.  An Overview of the NetWare Operating System , 1994, USENIX Winter.

[15]  Paulo Veríssimo,et al.  Real time and dependability concepts , 1993 .

[16]  Priya Narasimhan,et al.  Enforcing determinism for the consistent replication of multithreaded CORBA applications , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[17]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[18]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[19]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[20]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[21]  André Schiper,et al.  Phoenix: A Toolkit for Building Fault-Tolerant Distributed Applications in Large Scale , 1995 .

[22]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[23]  Santosh K. Shrivastava,et al.  The Voltan application programming environment for fail-silent processes , 1998, Distributed Syst. Eng..

[24]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[25]  Newtop: a fault-tolerant group communication protocol , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[26]  Fred B. Schneider,et al.  The primary-backup approach , 1993 .

[27]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[28]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[29]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[30]  Yair Amir,et al.  Replication using group communication over a partitioned network (שכפול באמצעות תקשרת קבוצות מעל רשת דינמית.) , 1995 .

[31]  Jean-Charles Fabre,et al.  Distributed coupled actors: A Chorus proposal for reliability , 1982, ICDCS.

[32]  Jong-Deok Choi,et al.  Deterministic replay of Java multithreaded applications , 1998, SPDT '98.

[33]  Stephen J. Fink,et al.  The Jalapeño virtual machine , 2000, IBM Syst. J..

[34]  Roy Friedman,et al.  Fast replicated state machines over partitionable networks , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[35]  S. Webber,et al.  The Stratus architecture , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[36]  Keith Marzullo,et al.  Highly-available services using the primary-backup approach , 1992, [1992 Proceedings] Second Workshop on the Management of Replicated Data.

[37]  Roy Friedman,et al.  Strong and weak virtual synchrony in Horus , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[38]  Miron Livny,et al.  Process hijacking , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).