Replicated convolutional codes: A design framework for repair-efficient distributed storage codes

Erasure-coded distributed storage systems can offer reliable storage services in a cost-effective manner. However, when disk failures occur in such systems, it is desirable to recreate the lost data with the help of surviving nodes to preserve the data redundancy. A key requirement during the recover process is to minimize the repair locality and computational complexity. We propose a simple framework for constructing storage codes that yield small repair locality and low repair complexity. The basic idea behind this framework is to take multiple instances of convolutional (tail-biting) codes, and then carefully arrange the coded symbols on storage nodes. The resultant codes enjoy the desirable repair-by-transfer property, and perform efficient repair by simple XOR operations. Moreover, we also evaluate the proposed codes atop an HDFS cluster testbed and compare the empirical performance with state-of-the-art repair-efficient storage codes.

[1]  Dimitris S. Papailiopoulos,et al.  Locally Repairable Codes , 2014, IEEE Trans. Inf. Theory.

[2]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[3]  Wei-Ho Chung,et al.  Novel Repair-by-Transfer Codes and Systematic Exact-MBR Codes with Lower Complexities and Smaller Field Sizes , 2014, IEEE Transactions on Parallel and Distributed Systems.

[4]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[5]  Dimitris S. Papailiopoulos,et al.  Locality and Availability in Distributed Storage , 2014, IEEE Transactions on Information Theory.

[6]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[7]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[8]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Jian Lin,et al.  Enabling Concurrent Failure Recovery for Regenerating-Coding-Based Storage Systems: From Theory to Practice , 2015, IEEE Transactions on Computers.

[10]  Kannan Ramchandran,et al.  Distributed Storage Codes With Repair-by-Transfer and Nonachievability of Interior Points on the Storage-Bandwidth Tradeoff , 2010, IEEE Transactions on Information Theory.

[11]  Anwitaman Datta Locally Repairable RapidRAID Systematic Codes — One simple convoluted way to get it all , 2014, 2014 IEEE Information Theory Workshop (ITW 2014).

[12]  Saurabh Bagchi,et al.  Partial-parallel-repair (PPR): a distributed technique for repairing erasure coded storage , 2016, EuroSys.

[13]  Minghua Chen,et al.  Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems , 2007, Sixth IEEE International Symposium on Network Computing and Applications (NCA 2007).

[14]  Chau Yuen,et al.  Local codes with addition based repair , 2015, 2015 IEEE Information Theory Workshop - Fall (ITW).

[15]  Kannan Ramchandran,et al.  Explicit construction of optimal exact regenerating codes for distributed storage , 2009, 2009 47th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[16]  Catherine D. Schuman,et al.  A Performance Evaluation and Examination of Open-Source Erasure Coding Libraries for Storage , 2009, FAST.

[17]  Alexandre Graell i Amat,et al.  A Family of Erasure Correcting Codes with Low Repair Bandwidth and Low Repair Complexity , 2014, GLOBECOM 2014.

[18]  Dimitris S. Papailiopoulos,et al.  Simple regenerating codes: Network coding for cloud storage , 2011, 2012 Proceedings IEEE INFOCOM.

[19]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[20]  Frédérique Oggier,et al.  Self-repairing homomorphic codes for distributed storage systems , 2010, 2011 Proceedings IEEE INFOCOM.

[21]  Frédérique E. Oggier,et al.  Locally repairable codes with multiple repair alternatives , 2013, 2013 IEEE International Symposium on Information Theory.

[22]  P. Vijay Kumar,et al.  On MBR codes with replication , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[23]  Cheng Huang,et al.  On the Locality of Codeword Symbols , 2011, IEEE Transactions on Information Theory.

[24]  Kenneth W. Shum,et al.  General Fractional Repetition Codes for Distributed Storage Systems , 2014, IEEE Communications Letters.

[25]  GhemawatSanjay,et al.  The Google file system , 2003 .

[26]  Kannan Ramchandran,et al.  Fractional repetition codes for repair in distributed storage systems , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[27]  Kannan Ramchandran,et al.  Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth , 2015, FAST.

[28]  Chao Tian,et al.  Exact-repair regenerating codes via layered erasure correction and block designs , 2013, 2013 IEEE International Symposium on Information Theory.