Optimal Control of Storage Regeneration with Repair Codes

High availability of containerized applications requires to perform robust storage of applications’ state. Since basic replication techniques are extremely costly at scale, storage space requirements can be reduced by means of erasure and/or repairing codes.In this paper we address storage regeneration using repair codes, a robust distributed storage technique with no need to fully restore the whole state in case of failure. In fact, only the lost servers’ content is replaced. To do so, new clean-slate storage units are made operational at a cost for activating new storage servers and a cost for the transfer of repair data.Our goal is to guarantee maximal availability of containers’ state files by a given deadline. Upon a fault occurring at a subset of the storage servers, we aim at ensuring that they are repaired by a given deadline. We introduce a controlled fluid model and derive the optimal activation policy to replace servers under such correlated faults. The solution concept is the optimal control of regeneration via the Pontryagin minimum principle. We characterize feasibility conditions and we prove that the optimal policy is of threshold type. Numerical results describe how to apply the model for system dimensioning and show the tradeoff between activation of servers and communication cost.

[1]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[2]  John Kubiatowicz,et al.  Erasure Coding Vs. Replication: A Quantitative Comparison , 2002, IPTPS.

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[5]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[6]  Anne-Marie Kermarrec,et al.  Regenerating Codes: A System Perspective , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[7]  Kannan Ramchandran,et al.  Distributed Storage Codes With Repair-by-Transfer and Nonachievability of Interior Points on the Storage-Bandwidth Tradeoff , 2010, IEEE Transactions on Information Theory.

[8]  Cheng Huang,et al.  Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads , 2012, FAST.

[9]  Robert J. Chansler,et al.  Data Availability and Durability with the Hadoop Distributed File System , 2012, login Usenix Mag..

[10]  Eitan Altman,et al.  Dynamic Control of Coding for Progressive Packet Arrivals in DTNs , 2013, IEEE Transactions on Wireless Communications.

[11]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[12]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[13]  Sachin Katti,et al.  Copysets: Reducing the Frequency of Data Loss in Cloud Storage , 2013, USENIX Annual Technical Conference.

[14]  Dimitris S. Papailiopoulos,et al.  Locally Repairable Codes , 2012, IEEE Transactions on Information Theory.

[15]  Komal Shringare,et al.  Apache Hadoop Goes Realtime at Facebook , 2015 .

[16]  Emin Gün Sirer,et al.  Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication , 2015, USENIX Annual Technical Conference.

[17]  Valentina Salapura,et al.  ResilientVM: high performance virtual machine recovery in the cloud , 2015, AIMC '15.

[18]  Ali Kanso,et al.  Comparing Containers versus Virtual Machines for Achieving High Availability , 2015, 2015 IEEE International Conference on Cloud Engineering.

[19]  Eric A. Brewer,et al.  Borg, Omega, and Kubernetes , 2016, ACM Queue.

[20]  S.Suganthi,et al.  Cassandra-A Decentralized Structured Storage System , 2017 .

[21]  U. Boscain,et al.  An Introduction to Optimal Control , 2022 .