Group communication protocols under errors

Group communication protocols constitute a basic building block for highly dependable distributed applications. Designing and correctly implementing a group communication system (GCS) is a difficult task. While many theoretical algorithms have been formalized and proved for correctness, only few research projects have experimentally assessed the dependability of GCS implementations under complex error scenarios. This paper describes a thorough error-injection experimental campaign conducted on Ensemble, a popular GCS. By employing synthetic benchmark applications, we stress selected components of the GCS $the group membership service, the FIFO-ordered reliable multicast - under various error models, including errors in the memory (text and heap segments) and in the network messages. The data show that about 5-6% of the failures are due to an error escaping Ensemble's error-containment mechanism and manifesting as a fail silence violation. This constitutes an impediment to achieving high dependability, the natural objective of GCSs. Our results are derived for a particular system (Ensemble), and more investigation involving other GCSs is required to generalize the conclusions. Nevertheless, through an accurate analysis of the failure causes and the error propagation patterns, this paper offers insights into the design and the implementation of robust GCSs.

[1]  William H. Sanders,et al.  Quantifying the cost of providing intrusion tolerance in group communication systems , 2002, Proceedings International Conference on Dependable Systems and Networks.

[2]  Neeraj Suri,et al.  On the placement of software mechanisms for detection of data errors , 2002, Proceedings International Conference on Dependable Systems and Networks.

[3]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behavior in computers without error masking , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[4]  Farnam Jahanian,et al.  Testing of fault-tolerant and real-time distributed systems via protocol fault injection , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[5]  Christoph Kreitz,et al.  A Proof Environment for the Development of Group Communication Systems , 1998, CADE.

[6]  William H. Sanders,et al.  Experimental Evaluation of the Unavailability Induced by a Group Membership Protocol , 2002, EDCC.

[7]  F. Cristian,et al.  Simulation-based test of fault-tolerant group membership services , 1997, Proceedings of COMPASS '97: 12th Annual Conference on Computer Assurance.

[8]  Ravishankar K. Iyer,et al.  NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors , 2000, Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000.

[9]  Jean Arlat,et al.  Fault Injection and Dependability Evaluation of Fault-Tolerant Systems , 1993, IEEE Trans. Computers.

[10]  Ravishankar K. Iyer,et al.  An experimental evaluation of the REE SIFT environment for spaceborne applications , 2002, Proceedings International Conference on Dependable Systems and Networks.

[11]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[12]  Ravishankar K. Iyer,et al.  A preemptive deterministic scheduling algorithm for multithreaded replicas , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[13]  Ravishankar K. Iyer,et al.  Characterization of linux kernel behavior under errors , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[14]  Yinong Chen,et al.  Evaluation of deterministic fault injection for fault-tolerant protocol testing , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[15]  Péter Urbán,et al.  Comparison of failure detectors and group membership: performance study of two atomic broadcast algorithms , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[16]  William H. Sanders,et al.  Numerical evaluation of a group-oriented multicast protocol using stochastic activity networks , 1995, Proceedings 6th International Workshop on Petri Nets and Performance Models.

[17]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[18]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[19]  Péter Urbán,et al.  Performance analysis of a consensus algorithm combining stochastic activity networks and measurements , 2002, Proceedings International Conference on Dependable Systems and Networks.

[20]  Jan Torin,et al.  On microprocessor error behavior modeling , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.