An experimental evaluation of correlated network partitions in the Coda distributed file system

Experimental evaluation is an important way to assess distributed systems, and fault injection is the dominant technique in this area for the evaluation of a system's dependability. For distributed systems, network failure is an important fault model. Physical network failures often have far-reaching effects, giving rise to multiple correlated failures as seen by higher-level protocols. This paper presents an experimental evaluation, using the Loki fault injector, which provides insight into the impact that correlated network partitions have on the Coda distributed file system. In this evaluation, Loki created a network partition between two Coda file servers, during which updates were made at each server to the same replicated data volume. Upon repair of the partition, a client requested directory resolution to converge the diverging replicas. At various stages of the resolution, Loki invoked a second correlated network partition, thus allowing us to evaluate its impact on the system's correctness, performance, and availability.

[1]  William H. Sanders,et al.  Experimental Evaluation of the Unavailability Induced by a Group Membership Protocol , 2002, EDCC.

[2]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[3]  William H. Sanders,et al.  Dynamic node management and measure estimation in a state-driven fault injector , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[4]  Farnam Jahanian,et al.  Testing of fault-tolerant and real-time distributed systems via protocol fault injection , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[5]  Mahadev Satyanarayanan,et al.  Flexible and Safe Resolution of File Conflicts , 1995, USENIX.

[6]  Ravishankar K. Iyer,et al.  Dependability analysis of a commercial high-speed network , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[7]  Reinhold Kröger,et al.  JEWEL: Design and Implementation of a Distributed Measurement System , 1992, IEEE Trans. Parallel Distributed Syst..

[8]  R. M. Lefever An Experimental Evaluation of the Coda Distributed File System Using the Loki State-Driven Fault Injector , 2003 .

[9]  William H. Sanders,et al.  Loki: a state-driven fault injector for distributed systems , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[10]  Devesh Bhatt,et al.  SPI: an instrumentation development environment for parallel/distributed systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[11]  Mahadev Satyanarayanan,et al.  An Empirical Study of a Highly Available File System , 1994, SIGMETRICS.

[12]  Mahadev Satyanarayanan,et al.  Coda: A Highly Available File System for a Distributed Workstation Environment , 1990, IEEE Trans. Computers.

[13]  Jean Arlat,et al.  Coverage Estimation Methods for Stratified Fault Injection , 1999, IEEE Trans. Computers.

[14]  William H. Sanders,et al.  Fault injection based on a partial view of the global state of a distributed system , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[15]  Puneet Kumar,et al.  Mitigating the Effects of Optimistic Replication in a Distributed File System , 1994 .

[16]  Ravishankar K. Iyer,et al.  NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors , 2000, Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000.

[17]  Jean Arlat,et al.  Experimental evaluation of the fault tolerance of an atomic multicast system , 1990 .

[18]  Mahadev Satyanarayanan,et al.  Log-based directory resolution in the Coda file system , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.