Analysis of failure correlation impact on peer-to-peer storage systems

Peer-to-peer storage systems aim to provide a reliable long-term storage at low cost. In such systems, peers fail continuously, hence, the necessity of self-repairing mechanisms to achieve high durability. In this paper, we propose and study analytical models that assess the bandwidth consumption and the probability to lose data of storage systems that use erasure coded redundancy. We show by simulations that the classical stochastic approach found in the literature, that models each block independently, gives a correct approximation of the system average behavior, but fails to capture its variations over time. These variations are caused by the simultaneous loss of multiple data blocks that results from a peer failing (or leaving the system). We then propose a new stochastic model based on a fluid approximation that better captures the system behavior. In addition to its expectation, it gives a correct estimation of its standard deviation. This new model is validated by simulations.

[1]  D. Mitra,et al.  Stochastic theory of a data-handling system with multiple sources , 1982, The Bell System Technical Journal.

[2]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[3]  Wei Chen,et al.  On the Impact of Replica Placement to the Reliability of Distributed Brick Storage Systems , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[4]  Andrew V. Goldberg,et al.  Towards an archival Intermemory , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[5]  D. M. Chiu,et al.  Erasure code replication revisited , 2004, Proceedings. Fourth International Conference on Peer-to-Peer Computing, 2004. Proceedings..

[6]  T. Kurtz Approximation of Population Processes , 1987 .

[7]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[8]  Abdulhalim Dandoush,et al.  Performance Analysis of Peer-to-Peer Storage Systems , 2007, ITC.

[9]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[10]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[11]  Karl Aberer,et al.  Internet-Scale Storage Systems under Churn -- A Study of the Steady-State using Markov Models , 2006, Sixth IEEE International Conference on Peer-to-Peer Computing (P2P'06).

[12]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[13]  Daniel A. Spielman,et al.  Practical loss-resilient codes , 1997, STOC '97.

[14]  David R. Karger,et al.  Analysis of the evolution of peer-to-peer systems , 2002, PODC '02.

[15]  Geoffrey M. Voelker,et al.  On Object Maintenance in Peer-to-Peer Systems , 2006, IPTPS.

[16]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[17]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[18]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[19]  Joseph Pasquale,et al.  Analysis of Long-Running Replicated Systems , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[20]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[21]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[22]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[23]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[24]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.