An Efficient I/O-Redirection-Based Reconstruction Scheme for Erasure-Coded Storage Clusters

This paper addresses an I/O interference problem encountered in on-line reconstruction of erasure-coded storage clusters, where user I/Os compete with reconstruction I/Os for both disk and network bandwidth. We propose a redirection scheme called `RAM-RS' to minimize the I/O interference among user and reconstruction requests. RAM-RS redirects user read/writes targeted at failed nodes to an RS-coded RAM region, which is formed by pre-allocated main memory in surviving nodes in the RS-coding manner. The RS-coded RAM region quickly serves all user read/write misses; therefore, a rebuilding node can devote its disk and network bandwidths to the node reconstruction. The RAM region substantially reduces the amount of data rebuilt by the rebuilding node, because (1) missed writes are buffered in the RAM region and (2) missed reads are satisfied by using surviving nodes to co-rebuild failed blocks. We build two Markov models to estimate the reliability of the RAM-RS system. Modeling results demonstrate that the MTTDL of RS-coded RAM region in a storage cluster is larger than that of the same cluster comprised of surviving nodes. We implement both RAM-RS and the traditional Redirection schemes in an erasure-coded storage cluster, on which real-world I/O traces are replayed. Experimental results show that compared with the Redirection scheme running on a 9-node storage cluster, RAM-RS improves system performance in terms of both user response time and reconstruction time by a factor of 1.78 and 1.20, respectively.

[1]  John C. S. Lui,et al.  Optimal recovery of single disk failure in RDP code storage systems , 2010, SIGMETRICS '10.

[2]  Bin Zhou,et al.  Scalable Performance of the Panasas Parallel File System , 2008, FAST.

[3]  Michael K. Reiter,et al.  Efficient Byzantine-tolerant erasure-coded storage , 2004, International Conference on Dependable Systems and Networks, 2004.

[4]  Eran Gabber,et al.  Data logging: a method for efficient data updates in constantly active RAIDs , 1998, Proceedings 14th International Conference on Data Engineering.

[5]  John C. S. Lui,et al.  Performance Analysis of Disk Arrays under Failure , 1990, VLDB.

[6]  Adam Wierman,et al.  Open Versus Closed: A Cautionary Tale , 2006, NSDI.

[7]  Catherine D. Schuman,et al.  A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[8]  Mario Blaum,et al.  SD codes: erasure codes designed for how storage systems really fail , 2013, FAST.

[9]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[10]  Ludmila Cherkasova,et al.  Analysis of enterprise media server workloads: access patterns, locality, content evolution, and rates of change , 2004, IEEE/ACM Transactions on Networking.

[11]  Hai Jin,et al.  Parity Logging Overcoming the Small Write Problem in Redundant Disk Arrays , 2002 .

[12]  Hong Jiang,et al.  HPDA: A hybrid parity-based disk array for enhanced performance and reliability , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[13]  Scott A. Brandt,et al.  Reliability mechanisms for very large storage systems , 2003, 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003. (MSST 2003). Proceedings..

[14]  Sriram Rao,et al.  A The Quantcast File System , 2013, Proc. VLDB Endow..

[15]  Hong Jiang,et al.  WorkOut: I/O Workload Outsourcing for Boosting RAID Reconstruction Performance , 2009, FAST.

[16]  Jin Qian,et al.  PARAID: A gear-shifting power-aware RAID , 2007, TOS.

[17]  Anne-Marie Kermarrec,et al.  Repairing Multiple Failures with Coordinated and Adaptive Regenerating Codes , 2011, 2011 International Symposium on Networking Coding.

[18]  Yang Tang,et al.  NCCloud: applying network coding for the storage repair in a cloud-of-clouds , 2012, FAST.

[19]  Pei Li,et al.  Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding , 2010, IEEE Journal on Selected Areas in Communications.

[20]  Mark Holland,et al.  On-Line Data Reconstruction in Redundant Disk Arrays (CMU-CS-94-164) , 1994 .

[21]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[22]  Daniel P. Siewiorek,et al.  Architectures and algorithms for on-line failure recovery in redundant disk arrays , 1994, Distributed and Parallel Databases.

[23]  Jian Lin,et al.  CORE: Augmenting regenerating-coding-based recovery for single and concurrent failures in distributed storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[24]  Tao Xie,et al.  MICRO: A Multilevel Caching-Based Reconstruction Optimization for Mobile Storage Systems , 2008, IEEE Transactions on Computers.

[25]  Alexander Reinefeld,et al.  Consistency and fault tolerance for erasure-coded distributed storage systems , 2012, DIDC '12.

[26]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[27]  Xin Li,et al.  Victim Disk First: An Asymmetric Cache to Boost the Performance of Disk Arrays under Faulty Conditions , 2011, USENIX Annual Technical Conference.

[28]  James Lee Hafner,et al.  Reliability for Networked Storage Nodes , 2011, IEEE Transactions on Dependable and Secure Computing.

[29]  Arif Merchant,et al.  A decentralized algorithm for erasure-coded virtual disks , 2004, International Conference on Dependable Systems and Networks, 2004.

[30]  Garth A. Gibson,et al.  Parity declustering for continuous operation in redundant disk arrays , 1992, ASPLOS V.

[31]  Dhabaleswar K. Panda,et al.  SSD-Assisted Hybrid Memory to Accelerate Memcached over High Performance Networks , 2012, 2012 41st International Conference on Parallel Processing.

[32]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[33]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[34]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[35]  Srinivasan Seshan,et al.  Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems , 2008, FAST.

[36]  Stefan Savage,et al.  AFRAID - A Frequently Redundant Array of Independent Disks , 1996, USENIX Annual Technical Conference.

[37]  Ethan L. Miller,et al.  Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage , 2008, FAST.

[38]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[39]  Marcos K. Aguilera,et al.  Using erasure codes efficiently for storage in a distributed system , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[40]  Qi Zhang,et al.  Characterization of storage workload traces from production Windows Servers , 2008, 2008 IEEE International Symposium on Workload Characterization.

[41]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[42]  Ethan L. Miller,et al.  Evaluation of distributed recovery in large-scale storage systems , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[43]  Mendel Rosenblum,et al.  Fast crash recovery in RAMCloud , 2011, SOSP.