论文信息 - CareDedup: Cache-Aware Deduplication for Reading Performance Optimization in Primary Storage

CareDedup: Cache-Aware Deduplication for Reading Performance Optimization in Primary Storage

Deduplication technology has been increasingly used to reduce the primary storage cost. In practice, it often causes additional on-disk fragmentation that impairs the reading performance. Existing deduplication algorithms mainly focus on the static data layout design so that the random I/O requests are largely avoided and the harmful effect can be alleviated. However, our trace-driven emulations show that, deduplication does not always impair the reading. It offers unique new opportunities for reading performance optimization by more possible cache hits. Motivated by this, we propose a novel cache-aware deduplication scheme CareDedup to well leverage the new opportunities. Based on a uniform locality assessment algorithm design, CareDedup selects the most profitable duplicated blocks to deduplicate for maximizing the reading performance. Our experimental evaluation using real-world traces shows that compared with the sequence-based deduplication algorithms, the duplicate elimination ratio and the reading performance (latency) can be both improved simultaneously. Given a desired duplicate elimination ratio, CareDedup can consistently outperforms sequence-based method by further reducing the reading latency by 2-5%.

Shanshan Li | Xiangke Liao | Bin Lin

[1] Shmuel Tomi Klein,et al. The design of a similarity based deduplication system , 2009, SYSTOR '09.

[2] David Hung-Chang Du,et al. Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[3] Xiaoning Ding,et al. DULO: an effective buffer cache management scheme to exploit both temporal and spatial locality , 2005, FAST'05.

[4] Kai Li,et al. Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[5] David Hung-Chang Du,et al. Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[6] André Brinkmann,et al. A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[7] David Hung-Chang Du,et al. BloomStore: Bloom-Filter based memory-efficient key-value store for indexing of data deduplication on flash , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[8] M. Lagoudakis. The 0 – 1 Knapsack Problem An Introductory Survey , 1996 .

[9] Stephen Mkandawire,et al. Improving Backup and Restore Performance for Deduplication-based Cloud Backup Services , 2012 .

[10] Lei Yang,et al. De-Frag: an efficient scheme to improve deduplication performance via reducing data placement de-linearization , 2014, Cluster Computing.

[11] Hong Jiang,et al. P-Dedupe: Exploiting Parallelism in Data Deduplication System , 2012, 2012 IEEE Seventh International Conference on Networking, Architecture, and Storage.

[12] Peter J. Denning,et al. The locality principle , 2005, CACM.

[13] Pin Zhou,et al. Demystifying data deduplication , 2008, Companion '08.

[14] Dutch T. Meyer,et al. A study of practical deduplication , 2011, TOS.

[15] Ian Pratt,et al. Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[16] David Geer. Reducing the Storage Burden via Data Deduplication , 2008, Computer.

[17] Petros Koutoupis. Data deduplication with Linux , 2011 .

[18] Mark Lillibridge,et al. Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[19] Cezary Dubnicki,et al. Anchor-driven subchunk deduplication , 2011, SYSTOR '11.

[20] Xiaodong Liu,et al. CareDedup: Cache-Aware Deduplication for Reading Performance Optimization in Primary Storage , 2016, DSC.

[21] Kai Li,et al. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[22] Yucheng Zhang,et al. Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[23] Fred Douglis,et al. Characteristics of backup workloads in production systems , 2012, FAST.

[24] Jin Li,et al. ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[25] Timothy Bisson,et al. iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[26] André Brinkmann,et al. Deriving and comparing deduplication techniques using a model-based classification , 2015, EuroSys.

[27] Maohua Lu,et al. Insights for data reduction in primary storage: a practical analysis , 2012, SYSTOR '12.

[28] David Hung-Chang Du,et al. Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[29] Sudipta Sengupta,et al. Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[30] Aleksey Pesterev,et al. Fast, Inexpensive Content-Addressed Storage in Foundation , 2008, USENIX Annual Technical Conference.

[31] Darrell D. E. Long,et al. Improved deduplication through parallel Binning , 2012, 2012 IEEE 31st International Performance Computing and Communications Conference (IPCCC).

[32] André Brinkmann,et al. dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[33] Michal Kaczmarczyk,et al. Reducing fragmentation impact with forward knowledge in backup systems with deduplication , 2015, SYSTOR.

[34] Xiaodong Zhang. Locality-aware Buffer Management: Algorithms Design and Systems Implementation for Data Intensive Applications (A Brief Progress Report) , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[35] Raju Rangaswami,et al. I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[36] André Brinkmann,et al. Design of an exact data deduplication cluster , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[37] Takashi Watanabe,et al. DBLK: Deduplication for primary block storage , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[38] Ethan L. Miller,et al. HANDS: A heuristically arranged non-backup in-line deduplication system , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[39] Qing Yang,et al. A New Buffer Cache Design Exploiting Both Temporal and Content Localities , 2010, 2010 IEEE 30th International Conference on Distributed Computing Systems.

[40] Yi Yang,et al. Locality Principle Revisited: A Probability-Based Quantitative Approach , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[41] Mark Lillibridge,et al. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.