A Cost-efficient Rewriting Scheme to Improve Restore Performance in Deduplication Systems

In chunk-based deduplication systems, logically consecutive chunks are physically scattered in different containers after deduplication, which results in the serious fragmentation problem. The fragmentation significantly reduces the restore performance due to reading the scattered chunks from different containers. Existing work aims to rewrite the fragmented duplicate chunks into new containers to improve the restore performance, which however produces the redundancy among containers, decreasing the deduplication ratio and resulting in redundant chunks in containers retrieved to restore the backup, which wastes limited disk bandwidth and decreases restore speed. To improve the restore performance while ensuring the high deduplication ratio, this paper proposes a cost-efficient submodular maximization rewriting scheme (SMR). SMR first formulates the defragmentation as an optimization problem of selecting suitable containers, and then builds a submodular maximization model to address this problem by selecting containers with more distinct referenced chunks. We implement SMR in the deduplication system, which is evaluated via two real-world datasets. Experimental results demonstrate that SMR is superior to the state-of-the-art work in terms of the restore performance as well as deduplication ratio. We have released the source code of SMR for public use.

[1]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[2]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[3]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[4]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[5]  W. Curtis Preston Backup & Recovery , 2006 .

[6]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[7]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[8]  Gang Wang,et al.  Lazy exact deduplication , 2016, 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[10]  David Hung-Chang Du,et al.  Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[11]  Hong Jiang,et al.  POD: Performance Oriented I/O Deduplication for Primary Storage Systems in the Cloud , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  Jian Liu,et al.  PLC-cache: Endurable SSD cache for deduplication-based primary storage , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[13]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[14]  Hong Jiang,et al.  Combining Deduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets , 2014, 2014 Data Compression Conference.

[15]  Fred Douglis,et al.  Migratory compression: coarse-grained data reordering to improve compressibility , 2014, FAST.

[16]  David Hung-Chang Du,et al.  Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[17]  Erez Zadok,et al.  A long-term user-centric analysis of deduplication patterns , 2016, 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Yifan Yang,et al.  A Near-Exact Defragmentation Scheme to Improve Restore Performance for Cloud Backup Systems , 2014, ICA3PP.

[19]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[20]  Michal Kaczmarczyk,et al.  Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[21]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[22]  Ramesh C. Jain,et al.  Summarization of personal photologs using multidimensional content and context , 2011, ICMR '11.

[23]  Hong Jiang,et al.  AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[24]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[25]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[26]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[27]  Hong Jiang,et al.  A Comprehensive Study of the Past, Present, and Future of Data Deduplication , 2016, Proceedings of the IEEE.

[28]  Jack Edmonds,et al.  Submodular Functions, Matroids, and Certain Polyhedra , 2001, Combinatorial Optimization.