Early Identification of Critical Blocks: Making Replicated Distributed Storage Systems Reliable Against Node Failures

In large-scale replicated distributed storage systems consisting of hundreds to thousands of nodes, node failures are not rare and can cause data blocks to lose their replicas and become faulty. A simple but effective approach to prevent data loss from the node failures, i.e., ensuring reliability, is to shorten the identification time of the node failures and faulty blocks, which is determined by both timeouts and check intervals for node states. However, to maintain low repair network traffic, the identification time is actually relatively long and even dominates repair processes of critical blocks. In this paper, we propose a novel scheme, named RICK, to explore the potential in the identification time, and thus improve data reliability of replicated distributed storage systems while maintaining a low repair cost. First, by introducing an additional replica state, critical blocks (with two or more lost replicas) have individual short timeouts while sick blocks (with only one lost replica) preserve the long timeouts. Second, by replacing the static check intervals for node states with adaptive ones, the check intervals and the identification time of critical blocks are further shortened, which improves data reliability. Meanwhile, due to the low ratio of critical blocks in all faulty blocks, the repair network traffic remains low. The results from our simulation and prototype implementation show that RICK improves data reliability of replicated distributed storage systems by a factor of up to 14 in terms of mean time to data loss. Meanwhile, the extra repair network traffic caused by RICK is less than 1.5 percent of the total network traffic for data repairs.

[1]  Wei Chen,et al.  On the Impact of Replica Placement to the Reliability of Distributed Brick Storage Systems , 2005, 25th IEEE International Conference on Distributed Computing Systems (ICDCS'05).

[2]  Dorian Mazauric,et al.  Data Life Time for Different Placement Policies in P2P Storage Systems , 2010, Globe.

[3]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[4]  Ingrid Jansch-Pôrto,et al.  QoS of timeout-based self-tuned failure detectors: the effects of the communication delay predictor and the safety margin , 2004, International Conference on Dependable Systems and Networks, 2004.

[5]  Anne-Marie Kermarrec,et al.  Availability-Based Methods for Distributed Storage Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[6]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[7]  Rodrigo Rodrigues,et al.  Proceedings of Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems Hotos Ix: the 9th Workshop on Hot Topics in Operating Systems High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two , 2022 .

[8]  Ben Y. Zhao,et al.  Probabilistic Failure Detection for Efficient Distributed Storage Maintenance , 2008, 2008 Symposium on Reliable Distributed Systems.

[9]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[10]  Joseph Pasquale,et al.  Analysis of durability in replicated distributed storage systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[11]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[12]  Lakshmi Ganesh,et al.  Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage , 2014, SYSTOR 2014.

[13]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[14]  Andreas Haeberlen,et al.  Proactive Replication for Data Durability , 2006, IPTPS.

[15]  J. Kubiatowicz,et al.  Long-Term Data Maintenance in Wide-Area Storage Systems : A Quantitative Approach , 2005 .

[16]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[17]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[18]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[19]  Ilias Iliadis,et al.  A General Reliability Model for Data Storage Systems , 2012, 2012 Ninth International Conference on Quantitative Evaluation of Systems.

[20]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[21]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[22]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Min Zhu,et al.  B4: experience with a globally-deployed software defined wan , 2013, SIGCOMM.

[24]  Lipeng Wan,et al.  A Report on Simulation-Driven Reliability and Failure Analysis of Large-Scale Storage Systems , 2014 .

[25]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[26]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[27]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[28]  Rajiv Ramnath,et al.  Towards building large-scale distributed systems for twitter sentiment analysis , 2012, SAC '12.

[29]  Ju Wang,et al.  Windows Azure Storage: a highly available cloud storage service with strong consistency , 2011, SOSP.

[30]  Xiaozhou Li,et al.  Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[31]  Amin Vahdat,et al.  Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network , 2015, Comput. Commun. Rev..

[32]  Mario Blaum,et al.  A Tale of Two Erasure Codes in HDFS , 2015, FAST.

[33]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[34]  Taoufik En-Najjary,et al.  Proactive replication in distributed storage systems using machine availability estimation , 2007, CoNEXT '07.

[35]  Sachin Katti,et al.  Copysets: Reducing the Frequency of Data Loss in Cloud Storage , 2013, USENIX Annual Technical Conference.

[36]  Anne-Marie Kermarrec,et al.  Regenerating Codes: A System Perspective , 2012, SRDS.

[37]  GhemawatSanjay,et al.  The Google file system , 2003 .

[38]  Marcos K. Aguilera,et al.  Improving Availability in Distributed Systems with Failure Informers , 2013, NSDI.

[39]  Marcos K. Aguilera,et al.  Detecting failures in distributed systems with the Falcon spy network , 2011, SOSP.

[40]  Christina Fragouli,et al.  Effect of Replica Placement on the Reliability of Large-Scale Data Storage Systems , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[41]  Andrea C. Arpaci-Dusseau,et al.  Analysis of HDFS under HBase: a facebook messages case study , 2014, FAST.

[42]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[43]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.

[44]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[45]  Joseph Pasquale,et al.  Analysis of Long-Running Replicated Systems , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.