RADOS: a scalable, reliable storage service for petabyte-scale storage clusters

Brick and object-based storage architectures have emerged as a means of improving the scalability of storage clusters. However, existing systems continue to treat storage nodes as passive devices, despite their ability to exhibit significant intelligence and autonomy. We present the design and implementation of RADOS, a reliable object storage service that can scales to many thousands of devices by leveraging the intelligence present in individual storage nodes. RADOS preserves consistent data access and strong safety semantics while allowing nodes to act semi-autonomously to self-manage replication, failure detection, and failure recovery through the use of a small cluster map. Our implementation offers excellent performance, reliability, and scalability while providing clients with the illusion of a single logical object store.

[1]  Scott A. Brandt,et al.  Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[2]  Wei Chen,et al.  On the Impact of Replica Placement to the Reliability of Distributed Brick Storage Systems , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[3]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[6]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[7]  James Lee Hafner,et al.  REO: A Generic RAID Engine and Optimizer , 2007, FAST.

[8]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[9]  Brent Welch,et al.  Managing Scalability in Object Storage Systems for HPC Linux Clusters , 2004, MSST.

[10]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[11]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[12]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[13]  Scott A. Brandt,et al.  The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device , 2006 .

[14]  Richard A. Golding,et al.  The design and evaluation of network RAID protocols , 2004 .

[15]  Richard A. Golding,et al.  D-SPTF: decentralized request distribution in brick-based storage systems , 2004, ASPLOS XI.

[16]  Gregory R. Ganger,et al.  Self-* Storage: Brick-based Storage with Automated Administration (CMU-CS-03-178) , 2003 .

[17]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[18]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[19]  GhemawatSanjay,et al.  The Google file system , 2003 .

[20]  Joseph S. Glider,et al.  IBM Research Report Kybos: Self-Management for Distributed Brick-Based Storage , 2005 .

[21]  Jacob R. Lorch,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OSDI '02.

[22]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[23]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[24]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[25]  Tao Yang,et al.  The Panasas ActiveScale Storage Cluster - Delivering Scalable High Bandwidth Storage , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[26]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[27]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[28]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[29]  Tao Yang,et al.  A Self-Organizing Storage Cluster for Parallel Data-Intensive Applications , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[30]  Berthier A. Ribeiro-Neto,et al.  Comparing random data allocation and data striping in multimedia servers , 2000, SIGMETRICS '00.

[31]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[32]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.