Jgroup/ARM: a distributed object group platform with autonomous replication management

This paper presents the design and implementation of Jgroup/ARM, a distributed object group platform with autonomous replication management along with a novel measurement‐based assessment technique that is used to validate the fault‐handling capability of Jgroup/ARM. Jgroup extends Java RMI through the group communication paradigm and has been designed specifically for application support in partitionable systems. ARM aims at improving the dependability characteristics of systems through a fault‐treatment mechanism. Hence, ARM focuses on deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest. The main objective of ARM is to localize failures and to reconfigure the system according to application‐specific dependability requirements. Combining Jgroup and ARM can significantly reduce the effort necessary for developing, deploying and managing dependable, partition‐aware applications. Jgroup/ARM is evaluated experimentally to validate its fault‐handling capability; the recovery performance of a system deployed in a wide area network is evaluated. In this experiment multiple nearly coincident reachability changes are injected to emulate network partitions separating the service replicas. The results show that Jgroup/ARM is able to recover applications to their initial state in several realistic failure scenarios, including multiple, concurrent network partitionings. Copyright © 2007 John Wiley & Sons, Ltd.

[1]  Heine Kolltveit High Availability Transactions , 2005 .

[2]  William H. Sanders,et al.  A global-state-triggered fault injector for distributed system evaluation , 2004, IEEE Transactions on Parallel and Distributed Systems.

[3]  Gianluca Dini,et al.  Enriched View Synchrony: A Programming Paradigm for Partitionable Asynchronous Distributed Systems , 1997, IEEE Trans. Computers.

[4]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[5]  André Schiper,et al.  On group communication in large-scale distributed systems , 1994, EW 6.

[6]  David L. Cohn,et al.  Autonomic Computing , 2003, ISADS.

[7]  David Powell,et al.  Distributed fault tolerance: lessons from Delta-4 , 1994, IEEE Micro.

[8]  Silvano Maffeis,et al.  The Object Group Design Pattern , 1996, COOTS.

[9]  Louise E. Moser,et al.  Transparent fault tolerance for java remote method invocation , 2001 .

[10]  Jim Waldo,et al.  The Jini Specification , 1999 .

[11]  William H. Sanders,et al.  An experimental evaluation of correlated network partitions in the Coda distributed file system , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[12]  Amin Vahdat,et al.  Building replicated Internet services using TACT: a toolkit for tunable availability and consistency tradeoffs , 2000, Proceedings Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. WECWIS 2000.

[13]  Farnam Jahanian,et al.  ORCHESTRA: A Fault Injection Environment for Distributed Systems , 1996 .

[14]  Hein Meling,et al.  Performance consequences of inconsistent client-side membership information in the open group model , 2004, IEEE International Conference on Performance, Computing, and Communications, 2004.

[15]  Bettina Kemme,et al.  Eager Replication for Stateful J2EE Servers , 2004, CoopIS/DOA/ODBASE.

[16]  Priya Narasimhan,et al.  Decentralized Resource Management and Fault-Tolerance for Distributed CORBA Applications , 2003, 2003 The Ninth IEEE International Workshop on Object-Oriented Real-Time Dependable Systems.

[17]  Priya Narasimhan,et al.  Consistent Object Replication in the external System , 1998, Theory Pract. Object Syst..

[18]  Priya Narasimhan,et al.  Strongly consistent replication and recovery of fault-tolerant CORBA applications , 2002, Comput. Syst. Sci. Eng..

[19]  Paul D. Ezhilchelvan,et al.  Design and implemantation of a CORBA fault-tolerant object group service , 1999, DAIS.

[20]  Hein Meling,et al.  ARM: Autonomous Replication Management in Jgroup , 2001 .

[21]  Alberto Montresor,et al.  System support for partition-aware network applications , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[22]  Alberto Montresor,et al.  Group Communication in Partitionable Systems: Specification and Algorithms , 2001, IEEE Trans. Software Eng..

[23]  Hein Meling Adaptive Middleware Support and Autonomous Fault Treatment: Architectural Design, Prototyping and Experimental Evaluation , 2006 .

[24]  Aniruddha S. Gokhale,et al.  DOORS: towards high-performance fault tolerant CORBA , 2000, Proceedings DOA'00. International Symposium on Distributed Objects and Applications.

[25]  A. M. Tobias Simulation methodology for statisticians, operations analysts and engineers , 1990 .

[26]  Jeff Magee,et al.  Client Access Protocols for Replicated Services , 1999, IEEE Trans. Software Eng..

[27]  Heine Kolltveit,et al.  Preventing Orphan Requests by Integrating Replication and Transactions , 2007, ADBIS.

[28]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[29]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[30]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[31]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[32]  A. Montresor System Support for Programming Object-Oriented Dependable Applications in Partitionable Systems (Ph.D. Thesis) , 2000 .

[33]  Hein Meling,et al.  Towards upgrading actively replicated servers on-the-fly , 2002, Proceedings 26th Annual International Computer Software and Applications.

[34]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[35]  Morris Sloman,et al.  Policy driven management for distributed systems , 1994, Journal of Network and Systems Management.

[36]  Bela Ban JavaGroups-Group communication patterns in Java , 1998 .

[37]  Yair Amir,et al.  A low latency, loss tolerant architecture and protocol for wide area group communication , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[38]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[39]  Microsystems Sun,et al.  Enterprise JavaBeans^ Specification Version 2.1 , 2002 .

[40]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[41]  Priya Narasimhan,et al.  Reconciling Replication and Transactions for the End-to-End Reliability of CORBA Applications , 2002, CoopIS/DOA/ODBASE.

[42]  Leslie Lamport,et al.  Interprocess Communication , 2020, Practical System Programming with C.

[43]  Patrick Th. Eugster,et al.  Replicating CORBA objects: a marriage between active and passive replication , 1999, DAIS.

[44]  Priya Narasimhan,et al.  Eternal—a component‐based framework for transparent fault‐tolerant CORBA , 2002, Softw. Pract. Exp..

[45]  Louise E. Moser,et al.  Surviving Network Partitioning , 1998, Computer.

[46]  Roger Wattenhofer,et al.  Large-scale simulation of replica placement algorithms for a serverless distributed file system , 2001, MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[47]  Silvano Maffeis,et al.  Adding Group Communication and Fault-Tolerance to CORBA , 1995, COOTS.

[48]  Rachid Guerraoui,et al.  The Implementation of a CORBA Object Group Service , 1998, Theory Pract. Object Syst..

[49]  Sampath Rangarajan,et al.  Filterfresh: Hot Replication of Java RMI Server Objects , 1998, COOTS.

[50]  Hein Meling,et al.  An Approach to Experimentally Obtain Service Dependability Characteristics of the Jgroup/ARM System , 2005, EDCC.

[51]  William H. Sanders,et al.  AQuA: an adaptive architecture that provides dependable distributed objects , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).