Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication

Cloud storage systems typically use three-way random replication to guard against data loss within the cluster, and utilize cluster geo-replication to protect against correlated failures. This paper presents a much lower cost alternative to full cluster geo-replication. We demonstrate that in practical settings, using two replicas is sufficient for protecting against independent node failures, while using three random replicas is inadequate for protecting against correlated node failures. We present Tiered Replication, a replication scheme that splits the cluster into a primary and backup tier. The first two replicas are stored on the primary tier and are used to recover data in the case of independent node failures, while the third replica is stored on the backup tier and is used to protect against correlated failures. The key insight of our paper is that, since the third replicas are rarely read, we can place the backup tier on separate physical infrastructure or a remote location without affecting performance. This separation significantly increases the resilience of the storage system to correlated failures and presents a low cost alternative to georeplication of an entire cluster. In addition, the Tiered Replication algorithm optimally minimizes the probability of data loss under correlated failures. Tiered Replication can be executed incrementally for each cluster change, which allows it to supports dynamic environments in which nodes join and leave the cluster, and it facilitates additional data placement constraints required by the storage designer, such as network and rack awareness. We have implemented Tiered Replication on HyperDex, an open-source cloud storage system, and demonstrate that it incurs a small performance overhead. Tiered Replication improves the cluster-wide MTTF by a factor of 20,000 compared to random replication and by a factor of 20 compared to previous non-random replication schemes, without increasing the amount of storage.

[1]  Sachin Katti,et al.  Copysets: Reducing the Frequency of Data Loss in Cloud Storage , 2013, USENIX Annual Technical Conference.

[2]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[3]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[4]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[5]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Cheng Li,et al.  Making geo-replicated systems fast as possible, consistent when necessary , 2012, OSDI 2012.

[7]  Sean Matthew Dorward,et al.  Awarded Best Paper! - Venti: A New Approach to Archival Data Storage , 2002 .

[8]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[9]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[10]  Jeffrey Dean,et al.  Evolution and future directions of large-scale storage and computation systems at Google , 2010, SoCC '10.

[11]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[12]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[13]  Austin Donnelly,et al.  Sierra: practical power-proportionality for data center storage , 2011, EuroSys '11.

[14]  Gregory R. Ganger,et al.  On Correlated Failures in Survivable Storage Systems , 2002 .

[15]  Michael J. Freedman,et al.  Stronger Semantics for Low-Latency Geo-Replicated Storage , 2013, NSDI.

[16]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[17]  Marcos K. Aguilera,et al.  Transactional storage for geo-replicated systems , 2011, SOSP.

[18]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[19]  GhemawatSanjay,et al.  The Google file system , 2003 .

[20]  Emin Gün Sirer,et al.  HyperDex: a distributed, searchable key-value store , 2012, SIGCOMM '12.

[21]  Ben Y. Zhao,et al.  Pond: The OceanStore Prototype , 2003, FAST.

[22]  Patrick P. C. Lee,et al.  Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[23]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[24]  Robert J. Chansler,et al.  Data Availability and Durability with the Hadoop Distributed File System , 2012, login Usenix Mag..

[25]  Marcos K. Aguilera,et al.  Transaction chains: achieving serializability with low latency in geo-distributed storage systems , 2013, SOSP.

[26]  Andreas Haeberlen,et al.  Proactive Replication for Data Durability , 2006, IPTPS.

[27]  Ramakrishna Kotla,et al.  SafeStore: A Durable and Practical Storage System , 2007, USENIX Annual Technical Conference.

[28]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[29]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[30]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[31]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[32]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[33]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[34]  Junfeng Yang,et al.  Kinesis: A new approach to replica placement in distributed storage systems , 2009, TOS.

[35]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[36]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[37]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[38]  Srinivasan Seshan,et al.  Subtleties in Tolerating Correlated Failures in Wide-area Storage Systems , 2006, NSDI.

[39]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.