Practical Single Node Failure Recovery Using Fractional Repetition Codes in Data Centers

Node failures in distributed storage systems are becoming a critical issue, and many erasure codes are designed to handle such failures. The purpose of this paper is to evaluate fractional repetition (FR) codes, a class of regenerating codes for distributed storage systems, as a practical solution. FR codes consist of a concatenation of an outer maximum distance separable (MDS) code and an inner fractional repetition code that splits the data into several blocks and stores multiple replicas of each on different nodes in the system. We model the problem as an integer linear programming problem that uses modified versions of the fractional repetition code by allowing different block sizes, and minimizes the recovery cost of all single node failure scenarios. The contribution of this work is three fold: We generate an optimized block distribution schema that minimizes the total system repair cost in a data center and we present a full recovery plan for the system. In addition, we account for new-comer blocks and allocate them to nodes with minimal computations and without changing the original optimal schema. This makes our work practical to apply. Hence, a practical solution for node failures is presented by using a self-designed genetic algorithm that searches within the feasible solution space. We show that our results are close to optimal.

[1]  Kannan Ramchandran,et al.  DRESS codes for the storage cloud: Simple randomized constructions , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[2]  Mario Blaum,et al.  Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems , 2014, TOS.

[3]  Dimitris S. Papailiopoulos,et al.  Simple regenerating codes: Network coding for cloud storage , 2011, 2012 Proceedings IEEE INFOCOM.

[4]  Chi Wan Sung,et al.  Irregular Fractional Repetition Code Optimization for Heterogeneous Cloud Storage , 2014, IEEE Journal on Selected Areas in Communications.

[5]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[6]  Yuepeng Chen,et al.  A self-crossover Genetic Algorithm for job shop scheduling problem , 2011, 2011 IEEE International Conference on Industrial Engineering and Engineering Management.

[7]  Mingqiang Li,et al.  STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures , 2014, TOS.

[8]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[9]  Michael Negnevitsky,et al.  Artificial Intelligence: A Guide to Intelligent Systems , 2001 .

[10]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[11]  Alexandros G. Dimakis,et al.  Network Coding for Distributed Storage Systems , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[12]  Kannan Ramchandran,et al.  Fractional repetition codes for repair in distributed storage systems , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[13]  Patrick P. C. Lee,et al.  A cost-based heterogeneous recovery scheme for distributed storage systems with RAID-6 codes , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).