论文信息 - Reducing Fragmentation for In-line Deduplication Backup Storage via Exploiting Backup History and Cache Knowledge

Reducing Fragmentation for In-line Deduplication Backup Storage via Exploiting Backup History and Cache Knowledge

In backup systems, the chunks of each backup are physically scattered after deduplication, which causes a challenging fragmentation problem. We observe that the fragmentation comes into sparse and out-of-order containers. The sparse container decreases restore performance and garbage collection efficiency, while the out-of-order container decreases restore performance if the restore cache is small. In order to reduce the fragmentation, we propose History-Aware Rewriting algorithm (HAR) and Cache-Aware Filter (CAF). HAR exploits historical information in backup systems to accurately identify and reduce sparse containers, and CAF exploits restore cache knowledge to identify the out-of-order containers that hurt restore performance. CAF efficiently complements HAR in datasets where out-of-order containers are dominant. To reduce the metadata overhead of the garbage collection, we further propose a Container-Marker Algorithm (CMA) to identify valid containers instead of valid chunks. Our extensive experimental results from real-world datasets show HAR significantly improves the restore performance by 2.84-175.36 × at a cost of only rewriting 0.5-2.03 percent data.

[1] Mark Lillibridge,et al. Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[2] Michal Kaczmarczyk,et al. HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[3] Min Xu,et al. Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage , 2014, TOS.

[4] Sean Quinlan,et al. Venti: A New Approach to Archival Storage , 2002, FAST.

[5] Michal Kaczmarczyk,et al. Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[6] Fred Douglis,et al. Migratory compression: coarse-grained data reordering to improve compressibility , 2014, FAST.

[7] André Brinkmann,et al. dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[8] Mark Lillibridge,et al. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[9] David Hung-Chang Du,et al. Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[10] Adam Silberstein,et al. Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[11] Peter Membrey,et al. The Linux Kernel , 2009 .

[12] Timothy Bisson,et al. iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[13] Kai Li,et al. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[14] David Hung-Chang Du,et al. Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[15] Erez Zadok,et al. Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[16] André Brinkmann,et al. File recipe compression in data deduplication systems , 2013, FAST.

[17] Philip Shilane,et al. Memory efficient sanitization of a deduplicated storage system , 2013, FAST.

[18] Hong Jiang,et al. MAD2: A scalable high-throughput exact deduplication approach for network backup services , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[19] Philip Shilane,et al. WAN-optimized replication of backup datasets using stream-informed delta compression , 2012, TOS.

[20] Laszlo A. Belady,et al. A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[21] Hong Jiang,et al. Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014, TOS.