Reducing Chunk Fragmentation for In-Line Delta Compressed and Deduplicated Backup Systems

Chunk-level deduplication, while robust in removing duplicate chunks, introduces chunk fragmentation which decreases restore performance. Rewriting algorithms are proposed to reduce the chunk fragmentation and accelerate the restore speed. Delta compression can remove redundant data between non-duplicate but similar chunks which cannot be eliminated by chunk-level deduplication. Some applications use delta compression as a complement for chunk-level deduplication to attain extra space and bandwidth savings. However, we observe that delta compression introduces a new type of chunk fragmentation stemming from delta compressed chunks whose base chunks are fragmented. We refer to such delta compressed chunks as base-fragmented chunks. We found that this new type of chunk fragmentation has a more severely impact on the restore performance than the chunk fragmentation introduced by chunk-level deduplication and cannot be reduced by existing rewriting algorithms. In order to address the problem due to the base-fragmented chunks, we propose SDC, a scheme that selectively performs delta compression after chunk-level deduplication. The main idea behind SDC is to simulate a restore cache to identify the non-base-fragmented chunks and only perform delta compression for these chunks, thus avoiding the new type of chunk fragmentation. Due to the locality among the backup streams, most of the non-base-fragmented chunks can be detected by the simulated restore cache. Experimental results based on real-world datasets show that SDC improves the restore performance of the delta compressed and deduplicated backup system by 1.93X-7.48X, and achieves 95.5%-97.4% of its compression, while imposing negligible impact on the backup throughput.

[1]  Xin Wang,et al.  QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services , 2017, IEEE Transactions on Mobile Computing.

[2]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[3]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[4]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[5]  D. K. S. Babu,et al.  DARE : A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads , 2018 .

[6]  Philip Shilane,et al.  WAN-optimized replication of backup datasets using stream-informed delta compression , 2012, TOS.

[7]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[8]  Hong Jiang,et al.  Combining Deduplication and Delta Compression to Achieve Low-Overhead Data Reduction on Backup Datasets , 2014, 2014 Data Compression Conference.

[9]  Medha Bhadkamkar,et al.  Identifying Trends in Enterprise Data Protection Systems , 2015, USENIX Annual Technical Conference.

[10]  Hong Jiang,et al.  SAR: SSD Assisted Restore Optimization for Deduplication-Based Storage Systems in the Cloud , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[11]  Philip Shilane,et al.  Delta Compressed and Deduplicated Storage Using Stream-Informed Locality , 2012, HotStorage.

[12]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[13]  Hong Jiang,et al.  Ddelta: A deduplication-inspired fast delta compression approach , 2014, Perform. Evaluation.

[14]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[15]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[16]  Michal Kaczmarczyk,et al.  Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[17]  Hong Jiang,et al.  Edelta: A Word-Enlarging Based Fast Delta Compression Approach , 2015, HotStorage.

[18]  Patrick P. C. Lee,et al.  RevDedup: a reverse deduplication storage system optimized for reads to latest backups , 2013, APSys.

[19]  Xue Liu,et al.  Neptune: Efficient remote communication services for cloud backups , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[20]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[21]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[22]  Hong Jiang,et al.  CABdedupe: A Causality-Based Deduplication Performance Booster for Cloud Backup Services , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[23]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[24]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[25]  Yucheng Zhang,et al.  SecDep: A user-aware efficient fine-grained secure deduplication scheme with multi-level key management , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[26]  David Hung-Chang Du,et al.  Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[27]  Hong Jiang,et al.  A Comprehensive Study of the Past, Present, and Future of Data Deduplication , 2016, Proceedings of the IEEE.

[28]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.