LDPC Code Design for Distributed Storage: Balancing Repair Bandwidth, Reliability, and Storage Overhead

Distributed storage systems suffer from significant repair traffic generated due to the frequent storage node failures. This paper shows that properly designed low-density parity-check (LDPC) codes can substantially reduce the amount of required block downloads for repair thanks to the sparse nature of their factor graph representation. In particular, with a careful construction of the factor graph, both low repair-bandwidth and high reliability can be achieved for a given code rate. First, a formula for the average repair bandwidth of LDPC codes is developed. This formula is then used to establish that the minimum repair bandwidth can be achieved by forcing a regular check node degree in the factor graph. Moreover, it is shown that given a fixed code rate, the variable node degree should also be regular to yield minimum repair bandwidth, under some reasonable minimum variable node degree constraint. It is also shown that for a given repair-bandwidth requirement, LDPC codes can yield substantially higher reliability than the currently utilized Reed–Solomon codes. Our reliability analysis is based on a formulation of the general equation for the mean-time-to-data-loss (MTTDL) associated with LDPC codes. The formulation reveals that the stopping number is closely related to the MTTDL. It is further shown that LDPC codes can be designed such that a small loss of repair-bandwidth optimality may be traded for a large improvement in erasure-correction capability and thus the MTTDL.

[1]  Masoud Ardakani,et al.  A Class of Binary Locally Repairable Codes , 2016, IEEE Transactions on Communications.

[2]  Gregory W. Wornell,et al.  Update-Efficiency and Local Repairability Limits for Capacity Approaching Codes , 2013, IEEE Journal on Selected Areas in Communications.

[3]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[4]  Wei Yu,et al.  Design of irregular LDPC codes with optimized performance-complexity tradeoff , 2010, IEEE Transactions on Communications.

[5]  Daniel A. Spielman,et al.  Practical loss-resilient codes , 1997, STOC '97.

[6]  Masoud Ardakani,et al.  An Efficient Binary Locally Repairable Code for Hadoop Distributed File System , 2014, IEEE Communications Letters.

[7]  Dariush Divsalar,et al.  Capacity-approaching protograph codes , 2009, IEEE Journal on Selected Areas in Communications.

[8]  Wei Zhong,et al.  Approaching Shannon performance by iterative decoding of linear codes with low-density generator matrix , 2003, IEEE Communications Letters.

[9]  Harald Øverby,et al.  Balanced locally repairable codes , 2016, 2016 9th International Symposium on Turbo Codes and Iterative Information Processing (ISTC).

[10]  Aria Nosratinia,et al.  The design of rate-compatible protograph LDPC codes , 2010 .

[11]  Wei Yongmei,et al.  Large LDPC Codes for Big Data Storage , 2015 .

[12]  Iryna Andriyanova,et al.  Some Results on Update Complexity of a Linear Code Ensemble , 2011, 2011 International Symposium on Networking Coding.

[13]  Nihar B. Shah,et al.  Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction , 2010, IEEE Transactions on Information Theory.

[14]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[15]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[16]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[17]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[18]  Evangelos Eleftheriou,et al.  Progressive edge-growth Tanner graphs , 2001, GLOBECOM'01. IEEE Global Telecommunications Conference (Cat. No.01CH37270).

[19]  Jaekyun Moon,et al.  Reducing repair-bandwidth using codes based on factor graphs , 2016, 2016 IEEE International Conference on Communications (ICC).

[20]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[21]  A. Orlitsky,et al.  Stopping sets and the girth of Tanner graphs , 2002, Proceedings IEEE International Symposium on Information Theory,.

[22]  Joseph Pasquale,et al.  Analysis of Long-Running Replicated Systems , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[23]  GhemawatSanjay,et al.  The Google file system , 2003 .

[24]  Yongmei Wei,et al.  expanCodes: Tailored LDPC Codes for Big Data Storage , 2016, 2016 IEEE 14th Intl Conf on Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[25]  James S. Plank,et al.  Small parity-check erasure codes - exploration and observations , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[26]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[27]  John I. McCool,et al.  Probability and Statistics With Reliability, Queuing and Computer Science Applications , 2003, Technometrics.

[28]  Yongmei Wei,et al.  The Auto-configurable LDPC Codes for Distributed Storage , 2014, 2014 IEEE 17th International Conference on Computational Science and Engineering.

[29]  Dimitris S. Papailiopoulos,et al.  Locally Repairable Codes , 2012, IEEE Transactions on Information Theory.

[30]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[31]  Rüdiger L. Urbanke,et al.  Design of capacity-approaching irregular low-density parity-check codes , 2001, IEEE Trans. Inf. Theory.

[32]  James S. Plank,et al.  A practical analysis of low-density parity-check erasure codes for wide-area storage applications , 2004, International Conference on Dependable Systems and Networks, 2004.

[33]  Sriram Vishwanath,et al.  Update-efficient codes for erasure correction , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[34]  Jehoshua Bruck,et al.  Zigzag Codes: MDS Array Codes With Optimal Rebuilding , 2011, IEEE Transactions on Information Theory.

[35]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[36]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[37]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.