Generalized Rack-aware Regenerating Codes for Jointly Optimal Node and Rack Repairs

In data centers, storage nodes are organized in racks and the cross-rack communication bandwidth is often much lower than the intra-rack communication bandwidth. Two common failures in data centers are single-node failures and single-rack failures. In this paper, we study the problem of minimizing the cross-rack repair bandwidth in both repairing single-node failures and repairing single-rack failures. We characterize, given that the minimum cross-rack repair bandwidth for single-node failures is achieved, the optimal trade-off between storage and cross-rack repair bandwidth for single-rack failures. We further propose a general family of storage codes, Generalized Rack-aware Regenerating Codes (GRRC), that achieve the optimal trade-off. We obtain two extreme points of GRRC, namely the minimum storage generalized rack-aware regeneration (MSGRR) point and the minimum bandwidth generalized rack-aware regeneration (MB-GRR) point. We show that MSGRR codes have strictly less cross-rack repair bandwidth for single-rack failures than the related minimum storage multi-node repair codes for most parameters. We also show that MBGRR codes have less cross-rack repair bandwidth for single-rack failures than the minimum bandwidth multi-node repair codes for all our evaluated parameters.

[1]  Patrick P. C. Lee,et al.  Double Regenerating Codes for hierarchical data centers , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[2]  Syed Hussain,et al.  Clay Codes: Moulding MDS Codes to Yield an MSR Code , 2018, FAST.

[3]  Hanxu Hou,et al.  Multi-Layer Transformed MDS Codes with Optimal Repair Access and Low Sub-Packetization , 2019, ArXiv.

[4]  Hanxu Hou,et al.  Binary MDS Array Codes With Optimal Repair , 2018, IEEE Transactions on Information Theory.

[5]  Jehoshua Bruck,et al.  Zigzag Codes: MDS Array Codes With Optimal Rebuilding , 2011, IEEE Transactions on Information Theory.

[6]  Jehoshua Bruck,et al.  Optimal Rebuilding of Multiple Erasures in MDS Codes , 2016, IEEE Transactions on Information Theory.

[7]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[8]  Alexander Barg,et al.  Explicit Constructions of Optimal-Access MDS Codes With Nearly Optimal Sub-Packetization , 2016, IEEE Transactions on Information Theory.

[9]  Chao Tian Characterizing the Rate Region of the (4,3,3) Exact-Repair Regenerating Codes , 2014, IEEE Journal on Selected Areas in Communications.

[10]  Cory Hill,et al.  f4: Facebook's Warm BLOB Storage System , 2014, OSDI.

[11]  Alexander Barg,et al.  Explicit Constructions of High-Rate MDS Array Codes With Optimal Repair Bandwidth , 2016, IEEE Transactions on Information Theory.

[12]  Jaekyun Moon,et al.  Capacity of clustered distributed storage , 2016, 2017 IEEE International Conference on Communications (ICC).

[13]  Jaume Pujol,et al.  Non-homogeneous two-rack model for distributed storage systems , 2013, 2013 IEEE International Symposium on Information Theory.

[14]  Dan Feng,et al.  Optimal Repair Layering for Erasure-Coded Data Centers , 2017, ACM Trans. Storage.

[15]  Kenneth W. Shum,et al.  Rack-Aware Regenerating Codes for Data Centers , 2019, IEEE Transactions on Information Theory.

[16]  Jie Li,et al.  A Generic Transformation to Enable Optimal Repair in MDS Codes for Distributed Storage Systems , 2016, IEEE Transactions on Information Theory.

[17]  Alexander Barg,et al.  Explicit Constructions of MSR Codes for Clustered Distributed Storage: The Rack-Aware Storage Model , 2020, IEEE Transactions on Information Theory.

[18]  Minghua Chen,et al.  BASIC Codes: Low-Complexity Regenerating Codes for Distributed Storage Systems , 2016, IEEE Transactions on Information Theory.

[19]  Kenneth W. Shum,et al.  Cooperative Regenerating Codes , 2012, IEEE Transactions on Information Theory.

[20]  Kannan Ramchandran,et al.  Interference Alignment in Regenerating Codes for Distributed Storage: Necessity and Code Constructions , 2010, IEEE Transactions on Information Theory.

[21]  Kenneth W. Shum,et al.  Storage and repair bandwidth tradeoff for heterogeneous cluster distributed storage systems , 2020, Science China Information Sciences.

[22]  Muriel Médard,et al.  The Storage Versus Repair-Bandwidth Trade-off for Clustered Storage Systems , 2018, IEEE Transactions on Information Theory.

[23]  Hanxu Hou,et al.  Minimum Storage Rack-Aware Regenerating Codes with Exact Repair and Small Sub-Packetization , 2020, 2020 IEEE International Symposium on Information Theory (ISIT).

[24]  Chaoping Xing,et al.  Optimal repairing schemes for Reed-Solomon codes with alphabet sizes linear in lengths under the rack-aware model , 2019, ArXiv.

[25]  Sriram Vishwanath,et al.  Centralized Repair of Multiple Node Failures With Applications to Communication Efficient Secret Sharing , 2016, IEEE Transactions on Information Theory.

[26]  V. Lalitha,et al.  Rack-Aware Cooperative Regenerating Codes , 2020, 2020 International Symposium on Information Theory and Its Applications (ISITA).

[27]  Zhiying Wang,et al.  Centralized Multi-Node Repair Regenerating Codes , 2017, IEEE Transactions on Information Theory.

[28]  Kannan Ramchandran,et al.  Asymptotic Interference Alignment for Optimal Repair of MDS Codes in Distributed Storage , 2013, IEEE Transactions on Information Theory.

[29]  Nihar B. Shah,et al.  Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction , 2010, IEEE Transactions on Information Theory.

[30]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.