Differential Erasure Codes for Efficient Archival of Versioned Data in Cloud Storage Systems

In this paper, we study the problem of storing an archive of versioned data in a reliable and efficient manner. The proposed technique is relevant in cloud settings, where, because of the huge volume of data to be stored, distributed scale-out storage systems deploying erasure codes for fault tolerance is typical. However existing erasure coding techniques do not leverage redundancy of information across multiple versions of a file. We propose a new technique called differential erasure coding DEC where the differences deltas between subsequent versions are stored rather than the whole objects, akini?źto a typical delta encoding technique. However, unlike delta encoding techniques, DEC opportunistically exploits the sparsity i.e., when the differences between two successive versions have few non-zero entries in the updates to store the deltas using sparse sampling techniques applied with erasure coding. We first show that DEC provides significant savings in the storage size for versioned data whenever the update patterns are characterized by in-place alterations. Subsequently, we propose a practical DEC framework so as to reap storage size benefits against not just in-place alterations but also real-world update patterns such as insertions and deletions that alter the overall data sizes. We conduct experiments with several synthetic and practical workloads to demonstrate that the practical variant of DEC provides significant reductions in storage-overhead.

[1]  F. MacWilliams,et al.  The Theory of Error-Correcting Codes , 1977 .

[2]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[3]  Han Mao Kiah,et al.  Synchronizing edits in distributed storage networks , 2014, 2015 IEEE International Symposium on Information Theory (ISIT).

[4]  Zhiying Wang,et al.  On multi-version coding for distributed storage , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[5]  Frédérique E. Oggier,et al.  Sparsity Exploiting Erasure Coding for Resilient Storage and Efficient I/O Access in Delta Based Versioning Systems , 2015, 2015 IEEE 35th International Conference on Distributed Computing Systems.

[6]  Kyumars Sheykh Esmaili,et al.  Efficient updates in cross-object erasure-coded storage systems , 2013, 2013 IEEE International Conference on Big Data.

[7]  Jérôme Lacan,et al.  A Construction of Matrices with No Singular Square Submatrices , 2003, International Conference on Finite Fields and Applications.

[8]  Yunghsiang Sam Han,et al.  Update-efficient regenerating codes with minimum per-node storage , 2013, 2013 IEEE International Symposium on Information Theory.

[9]  Frédérique E. Oggier,et al.  Coding Techniques for Repairability in Networked Distributed Storage Systems , 2013, Found. Trends Commun. Inf. Theory.

[10]  Zheng Shao,et al.  Data warehousing and analytics infrastructure at facebook , 2010, SIGMOD Conference.

[11]  Gregory W. Wornell,et al.  Update efficient codes for error correction , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[12]  Erez Zadok,et al.  Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[13]  Sriram Vishwanath,et al.  Update efficient codes for distributed storage , 2011, 2011 IEEE International Symposium on Information Theory Proceedings.

[14]  D. L. Donoho,et al.  Compressed sensing , 2006, IEEE Trans. Inf. Theory.

[15]  Frédérique E. Oggier,et al.  Sparsity exploiting erasure coding for distributed storage of versioned data , 2016, Computing.

[16]  Yunnan Wu,et al.  A Survey on Network Codes for Distributed Storage , 2010, Proceedings of the IEEE.

[17]  Fan Zhang,et al.  Compressed sensing and linear codes over real numbers , 2008, 2008 Information Theory and Applications Workshop.

[18]  Van-Anh Truong,et al.  Availability in Globally Distributed Storage Systems , 2010, OSDI.

[19]  Frédérique E. Oggier,et al.  DiVers: An erasure code based storage architecture for versioning exploiting sparsity , 2016, Future Gener. Comput. Syst..