Storage vs repair bandwidth for network erasure coding in distributed storage systems

Network coding is used in peer-to-peer storage systems, archival storage, wireless networks, satellite communication, video conferencing etc. Storage system stores data at different locations. For the data to be available, durable and reliable, it must be able to recover from failures efficiently. Different approaches applied for recovery in storage systems are examined and evaluated in this paper. Keeping replicas of the data at multiple places is traditional technique used by major storage systems. To reduce the amount of storage required by replication the distributed system is now transitioning towards, the erasure codes. Several approaches like the hybrid and regenerating codes provide solution to storage and repair bandwidth. But still improvement in terms of communication cost in the face of failures is required. The approaches and main application areas of these approaches are examined and analyzed in this paper. A comparative analysis based on storage requirement, disk access, repair bandwidth and unavailability probability is also presented.

[1]  Rodrigo Rodrigues,et al.  High Availability in DHTs: Erasure Coding vs. Replication , 2005, IPTPS.

[2]  Zhen Zhang Theory and Applications of Network Error Correction Coding , 2011, Proceedings of the IEEE.

[3]  F. Moore,et al.  Polynomial Codes Over Certain Finite Fields , 2017 .

[4]  Michael Williams,et al.  Replication in the harp file system , 1991, SOSP '91.

[5]  Frédéric Giroire,et al.  Hybrid Approaches for Distributed Storage Systems , 2011, Globe.

[6]  Jehoshua Bruck,et al.  X-Code: MDS Array Codes with Optimal Encoding , 1999, IEEE Trans. Inf. Theory.

[7]  Vipin Tyagi,et al.  Linear-code multicast on parallel architectures , 2011, Adv. Eng. Softw..

[8]  Yunnan Wu,et al.  Network coding for distributed storage systems , 2010, IEEE Trans. Inf. Theory.

[9]  GhemawatSanjay,et al.  The Google file system , 2003 .

[10]  Cheng Huang,et al.  STAR : An Efficient Coding Scheme for Correcting Triple Storage Node Failures , 2005, IEEE Transactions on Computers.

[11]  Nitin,et al.  Analysis of Multi-Sort Algorithm on Multi-Mesh of Trees (MMT) architecture , 2011, The Journal of Supercomputing.

[12]  Nitin,et al.  Analysis of All to All Broadcast on multi mesh of trees using genetic algorithm , 2009, 2009 International Conference on Ultra Modern Telecommunications & Workshops.

[13]  Vipin Tyagi,et al.  Efficient Broadcasting in Parallel Networks Using Network Coding , 2011 .

[14]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[15]  Sharon E. Perl,et al.  Myriad: Cost-Effective Disaster Tolerance , 2002, FAST.

[16]  Ning Cai,et al.  Network coding and error correction , 2002, Proceedings of the IEEE Information Theory Workshop.

[17]  J. Plank Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Storage Applications , 2005 .

[18]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[19]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[20]  Jacob R. Lorch,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OSDI '02.

[21]  Bruce Walker,et al.  The LOCUS distributed operating system , 1983, SOSP '83.

[22]  Alexandros G. Dimakis,et al.  The Benefits of Network Coding for Peer-to-Peer Storage Systems , 2006 .

[23]  Vipin Tyagi,et al.  Linear Network Coding on Multi-Mesh of Trees (MMT) using All to All Broadcast (AAB) , 2013, ArXiv.

[24]  Yang Tang,et al.  NCCloud: applying network coding for the storage repair in a cloud-of-clouds , 2012, FAST.

[25]  Rudolf Ahlswede,et al.  Network information flow , 2000, IEEE Trans. Inf. Theory.

[26]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[27]  Zhen Zhang,et al.  Linear Network Error Correction Codes in Packet Networks , 2008, IEEE Transactions on Information Theory.

[28]  Hai Jin,et al.  The EVENODD Code and its Generalization: An Efficient Scheme for Tolerating Multiple Disk Failures in RAID Architectures , 2002 .

[29]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[30]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.