HYDRAstor: A Scalable Secondary Storage

HYDRAstor is a scalable, secondary storage solution aimed at the enterprise market. The system consists of a back-end architectured as a grid of storage nodes built around a distributed hash table; and a front-end consisting of a layer of access nodes which implement a traditional file system interface and can be scaled in number for increased performance. This paper concentrates on the back-end which is, to our knowledge, the first commercial implementation of a scalable, high-performance content-addressable secondary storage delivering global duplicate elimination, per-block user-selectable failure resiliency, self-maintenance including automatic recovery from failures with data and network overlay rebuilding. The back-end programming model is based on an abstraction of a sea of variable-sized, content-addressed, immutable, highly-resilient data blocks organized in a DAG (directed acyclic graph). This model is exported with a low-level API allowing clients to implement new access protocols and to add them to the system on-line. The API has been validated with an implementation of the file system interface. The critical factor for meeting the design targets has been the selection of proper data organization based on redundant chains of data containers. We present this organization in detail and describe how it is used to deliver required data services. Surprisingly, the most complex to deliver turned out to be on-demand data deletion, followed (not surprisingly) by the management of data consistency and integrity.

[1]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[2]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[3]  Ethan L. Miller,et al.  Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage , 2008, FAST.

[4]  Dilma Da Silva,et al.  System Support for Online Reconfiguration , 2003, USENIX Annual Technical Conference, General Track.

[5]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[6]  Tao Yang,et al.  A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[7]  Marek Karpinski,et al.  An XOR-based erasure-resilient coding scheme , 1995 .

[8]  Darrell D. E. Long,et al.  Deep Store: an archival storage system architecture , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Cezary Dubnicki,et al.  FPN: a distributed hash table for commercial applications , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[10]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[11]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[12]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[13]  Andrea C. Arpaci-Dusseau,et al.  Deconstructing commodity storage clusters , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[14]  GhemawatSanjay,et al.  The Google file system , 2003 .

[15]  Liuba Shrira,et al.  Modular Software Upgrades for Distributed Systems , 2006, ECOOP.

[16]  Julio César López-Hernández,et al.  Stardust: tracking activity in a distributed storage system , 2006, SIGMETRICS '06/Performance '06.

[17]  Gregory R. Ganger,et al.  Early experiences on the journey towards self-* storage , 2006, IEEE Data Eng. Bull..

[18]  Chandramohan A. Thekkath,et al.  Petal: distributed virtual disks , 1996, ASPLOS VII.

[19]  Gregory R. Ganger,et al.  A Human Organization Analogy for Self-* Systems , 2003 .

[20]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[21]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[22]  David Mazières,et al.  Kademlia: A Peer-to-Peer Information System Based on the XOR Metric , 2002, IPTPS.

[23]  Carlos Maltzahn,et al.  RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[24]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[25]  Antony I. T. Rowstron,et al.  PAST: a large-scale, persistent peer-to-peer storage utility , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[26]  Aleksey Pesterev,et al.  Fast, Inexpensive Content-Addressed Storage in Foundation , 2008, USENIX Annual Technical Conference.

[27]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[28]  Ben Y. Zhao,et al.  OceanStore: An Extremely Wide-Area Storage System , 2002, ASPLOS 2002.

[29]  Chao Jin,et al.  RepStore: a self-managing and self-tuning storage backend with smart bricks , 2004 .

[30]  George Varghese,et al.  Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications , 2001, SIGCOMM 2001.

[31]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM 2001.

[32]  Kanishk Jain Object-based Storage , 2022 .

[33]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[34]  Daniel Ellard,et al.  DISP: Practical, efficient, secure and fault-tolerant distributed data storage , 2005, TOS.

[35]  Chao Jin,et al.  RepStore: a self-managing and self-tuning storage backend with smart bricks , 2004, International Conference on Autonomic Computing, 2004. Proceedings..

[36]  Marc Shapiro,et al.  A Survey of Distributed Garbage Collection Techniques , 1995, IWMM.

[37]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[38]  A. Rowstron,et al.  Scalable, decentralized object location and routing for large-scale peer-to-peer systems , 2001 .

[39]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[40]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.