USENIX Association Proceedings of the First Symposium on Networked Systems Design and Implementation

Availability is a storage system property that is both highly desired and yet minimally engineered. While many systems provide mechanisms to improve availability – such as redundancy and failure recovery – how to best configure these mechanisms is typically left to the system manager. Unfortunately, few individuals have the skills to properly manage the trade-offs involved, let alone the time to adapt these decisions to changing conditions. Instead, most systems are configured statically and with only a cursory understanding of how the configuration will impact overall performance or availability. While this issue can be problematic even for individual storage arrays, it becomes increasingly important as systems are distributed – and absolutely critical for the widearea peer-to-peer storage infrastructures being explored. This paper describes the motivation, architecture and implementation for a new peer-to-peer storage system, called TotalRecall, that automates the task of availability management. In particular, the TotalRecall system automatically measures and estimates the availability of its constituent host components, predicts their future availability based on past behavior, calculates the appropriate redundancy mechanisms and repair policies, and delivers user-specified availability while maximizing efficiency.

[1]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[2]  David Mazières,et al.  Rateless Codes and Big Downloads , 2003, IPTPS.

[3]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[4]  Robert Tappan Morris,et al.  Ivy: a read/write peer-to-peer file system , 2002, OSDI '02.

[5]  Krishna P. Gummadi,et al.  An analysis of Internet content delivery systems , 2002, OPSR.

[6]  Kimberly Keeton,et al.  Automating data dependability , 2002, EW 10.

[7]  Jacob R. Lorch,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OSDI '02.

[8]  Krishna P. Gummadi,et al.  A measurement study of Napster and Gnutella as examples of peer-to-peer file sharing systems , 2002, CCRV.

[9]  Noah Treuhaft,et al.  Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies , 2002 .

[10]  Roger Wattenhofer,et al.  Optimizing file availability in a secure serverless distributed file system , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[11]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[12]  David Mazières,et al.  A Toolkit for User-Level File Systems , 2001, USENIX Annual Technical Conference, General Track.

[13]  Brian C. Forney,et al.  Manageable storage via adaptation in WiND , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[14]  Ira Pramanick,et al.  High Availability , 2001, Int. J. High Perform. Comput. Appl..

[15]  Ben Y. Zhao,et al.  Silverback: A Global-Scale Archival System , 2001 .

[16]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[17]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[18]  Pradeep K. Khosla,et al.  Survivable Information Storage Systems , 2000, Computer.

[19]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[20]  Brent Callaghan,et al.  NFS Illustrated , 1999 .

[21]  Richard A. Golding,et al.  The HP AutoRAID hierarchical storage system , 1996, TOCS.

[22]  Stefan Savage,et al.  AFRAID - A Frequently Redundant Array of Independent Disks , 1996, USENIX Annual Technical Conference.