Architecture-aware Coding for Distributed Storage: Repairable Block Failure Resilient Codes

In large scale distributed storage systems (DSS) deployed in cloud computing, correlated failures resulting in simultaneous failure (or, unavailability) of blocks of nodes are common. In such scenarios, the stored data or a content of a failed node can only be reconstructed from the available live nodes belonging to the available blocks. To analyze the resilience of the system against such block failures, this work introduces the framework of Block Failure Resilient (BFR) codes, wherein the data (e.g., a file in DSS) can be decoded by reading out from a same number of codeword symbols (nodes) from a subset of available blocks of the underlying codeword. Further, repairable BFR codes are introduced, wherein any codeword symbol in a failed block can be repaired by contacting a subset of remaining blocks in the system. File size bounds for repairable BFR codes are derived, and the trade-off between per node storage and repair bandwidth is analyzed, and the corresponding minimum storage regenerating (BFR-MSR) and minimum bandwidth regenerating (BFR-MBR) points are derived. Explicit codes achieving the two operating points for a special case of parameters are constructed, wherein the underlying regenerating codewords are distributed to BFR codeword symbols according to combinatorial designs. Finally, BFR locally repairable codes (BFR-LRC) are introduced, an upper bound on the resilience is derived and optimal code construction are provided by a concatenation of Gabidulin and MDS codes. Repair efficiency of BFR-LRC is further studied via the use of BFR-MSR/MBR codes as local codes. Code constructions achieving optimal resilience for BFR-MSR/MBR-LRCs are provided for certain parameter regimes. Overall, this work introduces the framework of block failures along with optimal code constructions, and the study of architecture-aware coding for distributed storage systems.

[1]  Mario Blaum,et al.  Construction of two SD Codes , 2013, ArXiv.

[2]  Nihar B. Shah,et al.  Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction , 2010, IEEE Transactions on Information Theory.

[3]  O. Ozan Koyluoglu,et al.  A General Construction for PMDS Codes , 2017, IEEE Communications Letters.

[4]  Yunnan Wu,et al.  A Survey on Network Codes for Distributed Storage , 2010, Proceedings of the IEEE.

[5]  Dimitris S. Papailiopoulos,et al.  Locality and Availability in Distributed Storage , 2014, IEEE Transactions on Information Theory.

[6]  Onur Ozan Koyluoglu,et al.  Repairable Block Failure Resilient codes , 2014, 2014 IEEE International Symposium on Information Theory.

[7]  P. Vijay Kumar,et al.  Optimal linear codes with a local-error-correction property , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[8]  O. Antoine,et al.  Theory of Error-correcting Codes , 2022 .

[9]  Zhifang Zhang,et al.  Repair Locality With Multiple Erasure Tolerance , 2014, IEEE Transactions on Information Theory.

[10]  GhemawatSanjay,et al.  The Google file system , 2003 .

[11]  Mario Blaum,et al.  Partial-MDS Codes and Their Application to RAID Type of Architectures , 2012, IEEE Transactions on Information Theory.

[12]  Cheng Huang,et al.  Permutation code: Optimal exact-repair of a single failed node in MDS code based distributed storage systems , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[13]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[14]  Chao Tian,et al.  Layered Exact-Repair Regenerating Codes via Embedded Error Correction and Block Designs , 2014, IEEE Transactions on Information Theory.

[15]  A. Dimakis,et al.  Deterministic Regenerating Codes for Distributed Storage Yunnan , 2007 .

[16]  Kannan Ramchandran,et al.  Fractional repetition codes for repair in distributed storage systems , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[17]  Frédérique Oggier,et al.  Self-repairing homomorphic codes for distributed storage systems , 2010, 2011 Proceedings IEEE INFOCOM.

[18]  Jaume Pujol,et al.  A realistic distributed storage system: the rack model , 2013, ArXiv.

[19]  Yunnan Wu Existence and construction of capacity-achieving network codes for distributed storage , 2009, 2009 IEEE International Symposium on Information Theory.

[20]  Muriel Médard,et al.  The Storage Versus Repair-Bandwidth Trade-off for Clustered Storage Systems , 2018, IEEE Transactions on Information Theory.

[21]  Sriram Vishwanath,et al.  Secure Cooperative Regenerating Codes for Distributed Storage Systems , 2012, IEEE Transactions on Information Theory.

[22]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[23]  P. Vijay Kumar,et al.  Codes with local regeneration , 2012, 2013 IEEE International Symposium on Information Theory.

[24]  Tracey Ho,et al.  A Random Linear Network Coding Approach to Multicast , 2006, IEEE Transactions on Information Theory.

[25]  Sriram Vishwanath,et al.  Optimal Locally Repairable and Secure Codes for Distributed Storage Systems , 2012, IEEE Transactions on Information Theory.

[26]  Nihar B. Shah,et al.  Enabling node repair in any erasure code for distributed storage , 2010, 2011 IEEE International Symposium on Information Theory Proceedings.

[27]  Minghua Chen,et al.  Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems , 2007, Sixth IEEE International Symposium on Network Computing and Applications (NCA 2007).

[28]  Jehoshua Bruck,et al.  Zigzag Codes: MDS Array Codes With Optimal Rebuilding , 2011, IEEE Transactions on Information Theory.

[29]  Natalia Silberstein,et al.  Optimal Fractional Repetition Codes Based on Graphs and Designs , 2014, IEEE Transactions on Information Theory.

[30]  Sriram Vishwanath,et al.  Error resilience in distributed storage via rank-metric codes , 2012, 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[31]  Rafail Ostrovsky,et al.  Batch codes and their applications , 2004, STOC '04.

[32]  Sriram Vishwanath,et al.  Explicit MBR all-symbol locality codes , 2013, 2013 IEEE International Symposium on Information Theory.

[33]  Dimitris S. Papailiopoulos,et al.  Locally Repairable Codes , 2012, IEEE Transactions on Information Theory.

[34]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[35]  Itzhak Tamo,et al.  A Family of Optimal Locally Recoverable Codes , 2013, IEEE Transactions on Information Theory.

[36]  Sriram Vishwanath,et al.  Error-Correcting Regenerating and Locally Repairable Codes via Rank-Metric Codes , 2015, IEEE Transactions on Information Theory.

[37]  Dimitris S. Papailiopoulos,et al.  Repair Optimal Erasure Codes Through Hadamard Designs , 2011, IEEE Transactions on Information Theory.

[38]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[39]  Cheng Huang,et al.  On the Locality of Codeword Symbols , 2011, IEEE Transactions on Information Theory.

[40]  Aditya Ramamoorthy,et al.  Fractional Repetition Codes With Flexible Repair From Combinatorial Designs , 2014, IEEE Transactions on Information Theory.

[41]  P. Vijay Kumar,et al.  Codes With Local Regeneration and Erasure Correction , 2014, IEEE Transactions on Information Theory.