On Correlated Failures in Survivable Storage Systems (CMU-CS-02-129)

The design of survivable storage systems involves inherent trade-offs among properties such as performance, security, and availability. A toolbox of simple and accurate models of these properties allows a designer to make informed decisions. This report focuses on availability modeling. We describe two ways of extending the classic model of availability with a single “correlation parameter” to accommodate correlated failures. We evaluate the efficacy of the models by comparing their results with real measurements. We also show the use of the models as design decision tools: we analyze the effects of availability and correlation on the ordering of data distribution schemes and we investigate the placement of related files. Acknowledgements: This work is partially funded by DARPA/ITO’s Organically Assured and Survivable Information Systems (OASIS) Program (Air Force contract number F30602-99-2-0539-AFRL). We thank the members and companies of the PDL Consortium (including EMC, HP, Hitachi, IBM, Intel, Network Appliance, Panasas, Seagate, Sun and Veritas) for their interest, insights and support. We also thank Microsoft Corporation for sharing their availability measurements of the desktops in their campus.

[1]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[2]  Pradeep K. Khosla,et al.  Survivable Information Storage Systems , 2000, Computer.

[3]  Roger Wattenhofer,et al.  Modeling Replica Placement in a Distributed File System: Narrowing the Gap between Analysis and Simulation , 2001, ESA.

[4]  Kishor S. Trivedi,et al.  Performance and reliability evaluation of passive replication schemes in application level fault tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[5]  Catherine A. Meadows,et al.  Security of Ramp Schemes , 1985, CRYPTO.

[6]  Daniel A. Spielman,et al.  Analysis of low density codes and improved designs using irregular graphs , 1998, STOC '98.

[7]  Ravishankar K. Iyer,et al.  Failure data analysis of a LAN of Windows NT based computers , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[8]  Adi Shamir,et al.  How to share a secret , 1979, CACM.

[9]  Bev Littlewood,et al.  Conceptual Modeling of Coincident Failures in Multiversion Software , 1989, IEEE Trans. Software Eng..

[10]  D. Griffiths Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total number of cases of a disease. , 1973, Biometrics.

[11]  Donovan A. Schneider,et al.  The Gamma Database Machine Project , 1990, IEEE Trans. Knowl. Data Eng..

[12]  Andrew V. Goldberg,et al.  Towards an archival Intermemory , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[13]  Victor F. Nicola,et al.  Modeling of Correlated Failures and Community Error Recovery in Multiversion Software , 1990, IEEE Trans. Software Eng..

[14]  Pradeep K. Khosla,et al.  Selecting the Right Data Distribution Scheme for a Survivable Storage System (CMU-CS-01-120) , 2001 .

[15]  Ravishankar K. Iyer,et al.  Analysis and Modeling of Correlated Failures in Multicomputer Systems , 1992, IEEE Trans. Computers.

[16]  Griffiths Da Maximum likelihood estimation for the beta-binomial distribution and an application to the household distribution of the total number of cases of a disease. , 1973 .

[17]  James S. Plank A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems , 1997 .

[18]  Hanne Niss Made in Denmark , 1994 .

[19]  Yair Amir,et al.  Evaluating quorum systems over the Internet , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[20]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[21]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[22]  Roger Wattenhofer,et al.  Competitive Hill-Climbing Strategies for Replica Placement in a Distributed File System , 2001, DISC.

[23]  Roger Wattenhofer,et al.  Optimizing file availability in a secure serverless distributed file system , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[24]  Yair Amir,et al.  Optimal Availability Quorum Systems: Theory and Practice , 1998, Inf. Process. Lett..

[25]  Michael O. Rabin,et al.  Efficient dispersal of information for security, load balancing, and fault tolerance , 1989, JACM.