Opening the Chrysalis: On the Real Repair Performance of MSR Codes

Large distributed storage systems use erasure codes to reliably store data. Compared to replication, erasure codes are capable of reducing storage overhead. However, repairing lost data in an erasure coded system requires reading from many storage devices and transferring over the network large amounts of data. Theoretically, Minimum Storage Regenerating (MSR) codes can significantly reduce this repair burden. Although several explicit MSR code constructions exist, they have not been implemented in real-world distributed storage systems. We close this gap by providing a performance analysis of Butterfly codes, systematic MSR codes with optimal repair I/O. Due to the complexity of modern distributed systems, a straightforward approach does not exist when it comes to implementing MSR codes. Instead, we show that achieving good performance requires to vertically integrate the code with multiple system layers. The encoding approach, the type of inter-node communication, the interaction between different distributed system layers, and even the programming language have a significant impact on the code repair performance. We show that with new distributed system features, and careful implementation, we can achieve the theoretically expected repair performance of MSR codes.

[1]  Dimitris S. Papailiopoulos,et al.  Repair Optimal Erasure Codes Through Hadamard Designs , 2011, IEEE Transactions on Information Theory.

[2]  Kannan Ramchandran,et al.  Exact-Repair MDS Code Construction Using Interference Alignment , 2011, IEEE Transactions on Information Theory.

[3]  Minghua Chen,et al.  Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems , 2007, Sixth IEEE International Symposium on Network Computing and Applications (NCA 2007).

[4]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[5]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[6]  S.A. Brandt,et al.  CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[7]  Garth A. Gibson,et al.  DiskReduce : Replication as a Prelude to Erasure Coding in Data-Intensive Scalable Computing , 2011 .

[8]  Carlos Maltzahn,et al.  RADOS: a scalable, reliable storage service for petabyte-scale storage clusters , 2007, PDSW '07.

[9]  Mario Blaum,et al.  A Tale of Two Erasure Codes in HDFS , 2015, FAST.

[10]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[11]  Kannan Ramchandran,et al.  Interference Alignment in Regenerating Codes for Distributed Storage: Necessity and Code Constructions , 2010, IEEE Transactions on Information Theory.

[12]  Jehoshua Bruck,et al.  Zigzag Codes: MDS Array Codes With Optimal Rebuilding , 2011, IEEE Transactions on Information Theory.

[13]  Robert Mateescu,et al.  Repair-optimal MDS array codes over GF(2) , 2013, 2013 IEEE International Symposium on Information Theory.

[14]  Yang Tang,et al.  NCCloud: applying network coding for the storage repair in a cloud-of-clouds , 2012, FAST.

[15]  Nihar B. Shah,et al.  Optimal Exact-Regenerating Codes for Distributed Storage at the MSR and MBR Points via a Product-Matrix Construction , 2010, IEEE Transactions on Information Theory.

[16]  Andreas Haeberlen,et al.  Glacier: highly durable, decentralized storage despite massive correlated failures , 2005, NSDI.

[17]  Yunnan Wu,et al.  Network coding for distributed storage systems , 2010, IEEE Trans. Inf. Theory.

[18]  Syed Ali Jafar,et al.  Distributed Data Storage with Minimum Storage Regenerating Codes - Exact and Functional Repair are Asymptotically Equally Efficient , 2010, ArXiv.

[19]  Kannan Ramchandran,et al.  Asymptotic Interference Alignment for Optimal Repair of MDS Codes in Distributed Storage , 2013, IEEE Transactions on Information Theory.

[20]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[21]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[22]  Alexandros G. Dimakis,et al.  The Benefits of Network Coding for Peer-to-Peer Storage Systems , 2006 .

[23]  Yunghsiang Sam Han,et al.  A Unified Form of Exact-MSR Codes via Product-Matrix Frameworks , 2015, IEEE Transactions on Information Theory.

[24]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[25]  P. Vijay Kumar,et al.  A high-rate MSR code with polynomial sub-packetization level , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[26]  Cheng Huang,et al.  Permutation code: Optimal exact-repair of a single failed node in MDS code based distributed storage systems , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[27]  Yang Tang,et al.  NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds , 2014, IEEE Transactions on Computers.

[28]  GhemawatSanjay,et al.  The Google file system , 2003 .

[29]  Kannan Ramchandran,et al.  Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth , 2015, FAST.

[30]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[31]  V. Cadambe,et al.  Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage , 2010, 2010 Third IEEE International Workshop on Wireless Network Coding.