PITR: An Efficient Single-Failure Recovery Scheme for PIT-Coded Cloud Storage Systems

In cloud storage systems, the use of erasure coding results in high read latency and long recovery time when drive or node failure happens. In this paper, we design a parity independent array codes (PIT), a variation of STAR code, which is triple fault tolerant and nearly space-optimal, and also propose an efficient single-failure recovery scheme (PITR) for them to mitigate the problem. In addition, we present a "shortened" version of PIT (SPIT) to further reduce the recovery cost. In this way, less disk I/O and network resources are used, thereby reducing the recovery time and achieving a high system reliability and availability.

[1]  John C. S. Lui,et al.  Optimal recovery of single disk failure in RDP code storage systems , 2010, SIGMETRICS '10.

[2]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[3]  Dimitris S. Papailiopoulos,et al.  Simple regenerating codes: Network coding for cloud storage , 2011, 2012 Proceedings IEEE INFOCOM.

[4]  Gang Wang,et al.  ProCode: A Proactive Erasure Coding Scheme for Cloud Storage Systems , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[5]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[6]  Patrick P. C. Lee,et al.  On the speedup of single-disk failure recovery in XOR-coded storage systems: Theory and practice , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Kannan Ramchandran,et al.  A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers , 2014 .

[8]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[9]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[10]  John C. S. Lui,et al.  Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems , 2014, IEEE Transactions on Computers.

[11]  Gang Wang,et al.  Parallelizing Degraded Read for Erasure Coded Cloud Storage Systems Using Collective Communications , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.

[12]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[13]  Alexandros G. Dimakis,et al.  Rebuilding for array codes in distributed storage systems , 2010, 2010 IEEE Globecom Workshops.

[14]  Kannan Ramchandran,et al.  Exact Regenerating Codes for Distributed Storage , 2009, ArXiv.

[15]  Saurabh Bagchi,et al.  Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage , 2016, EuroSys.

[16]  GhemawatSanjay,et al.  The Google file system , 2003 .

[17]  Cheng Huang,et al.  In Search of I/O-Optimal Recovery from Disk Failures , 2011, HotStorage.

[18]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..