HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

Eliminating duplicate data in primary storage of clouds increases the cost-efficiency of cloud service providers as well as reduces the cost of users for using cloud services. Existing primary deduplication techniques either use inline caching to exploit locality in primary workloads or use post-processing deduplication running in system idle time to avoid the negative impact on I/O performance. However, neither of them works well in the cloud servers running multiple services or applications for the following two reasons: Firstly, the temporal locality of duplicate data writes may not exist in some primary storage workloads thus inline caching often fails to achieve good deduplication ratio. Secondly, the post-processing deduplication allows duplicate data to be written into disks, therefore does not provide the benefit of I/O deduplication and requires high peak storage capacity. This paper presents HPDedup, a Hybrid Prioritized data Deduplication mechanism to deal with the storage system shared by applications running in co-located virtual machines or containers by fusing an inline and a post-processing process for exact deduplication. In the inline deduplication phase, HPDedup gives a fingerprint caching mechanism that estimates the temporal locality of duplicates in data streams from different VMs or applications and prioritizes the cache allocation for these streams based on the estimation. HPDedup also allows different deduplication threshold for streams based on their spatial locality to reduce the disk fragmentation. The post-processing phase removes duplicates whose fingerprints are not able to be cached due to the weak temporal locality from disks. Our experimental results show that HPDedup clearly outperforms the state-of-the-art primary storage deduplication techniques in terms of inline cache efficiency and primary deduplication efficiency.

[1]  Jing Xu,et al.  CloudCache: On-demand Flash Cache Management for Cloud Computing , 2016, FAST.

[2]  Gregory Valiant,et al.  Estimating the Unseen , 2017, J. ACM.

[3]  Hong Jiang,et al.  POD: Performance Oriented I/O Deduplication for Primary Storage Systems in the Cloud , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[4]  Danny Harnik,et al.  Estimating Unseen Deduplication - from Theory to Practice , 2016, FAST.

[5]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[6]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[7]  B. Lindsay,et al.  Estimating the number of classes , 2007, 0708.2153.

[8]  Jian Liu,et al.  PLC-cache: Endurable SSD cache for deduplication-based primary storage , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Erez Zadok,et al.  A long-term user-centric analysis of deduplication patterns , 2016, 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Ying Li,et al.  DIODE: Dynamic Inline-Offline DE Duplication Providing Efficient Space-Saving and Read/Write Performance for Primary Storage Systems , 2016, 2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS).

[11]  Paul Valiant Testing symmetric properties of distributions , 2008, STOC '08.

[12]  André Brinkmann,et al.  Sorted deduplication: How to process thousands of backup streams , 2016, 2016 32nd Symposium on Mass Storage Systems and Technologies (MSST).

[13]  David D. Chambliss,et al.  Mixing Deduplication and Compression on Active Data Sets , 2011, 2011 Data Compression Conference.

[14]  Hong Jiang,et al.  Application-Aware Local-Global Source Deduplication for Cloud Backup Services of Personal Storage , 2014, IEEE Transactions on Parallel and Distributed Systems.

[15]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[16]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[17]  João Paulo,et al.  Efficient Deduplication in a Distributed Primary Storage Infrastructure , 2016, TOS.

[18]  Nimrod Megiddo,et al.  Linear Programming in Linear Time When the Dimension Is Fixed , 1984, JACM.

[19]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[20]  Xu Zhang,et al.  PDFS: Partially Dedupped File System for Primary Workloads , 2017, IEEE Transactions on Parallel and Distributed Systems.

[21]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[22]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[23]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[24]  Raju Rangaswami,et al.  Centaur: Host-Side SSD Caching for Storage Performance Control , 2015, 2015 IEEE International Conference on Autonomic Computing.

[25]  Sun Zhen,et al.  Using Hints to Improve Inline Block-layer Deduplication , 2016, FAST.

[26]  Dan Feng,et al.  Improving flash-based disk cache with Lazy Adaptive Replacement , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[27]  Min Xu,et al.  Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage , 2014, TOS.

[28]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[29]  Nisha Talagala,et al.  HEC: improving endurance of high performance flash-based cache devices , 2013, SYSTOR '13.

[30]  Ravi Kumar,et al.  Sampling algorithms: lower bounds and applications , 2001, STOC '01.

[31]  Ethan L. Miller,et al.  HANDS: A heuristically arranged non-backup in-line deduplication system , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[32]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[33]  Gregory Valiant,et al.  Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs , 2011, STOC '11.

[34]  Erez Zadok,et al.  Dmdedup : Device Mapper Target for Data Deduplication , 2014 .

[35]  Giri Narasimhan,et al.  CacheDedup: In-line Deduplication for Flash Caching , 2016, FAST.

[36]  Xiaodong Liu,et al.  Leach: an automatic learning cache for inline primary deduplication system , 2014, Frontiers of Computer Science.

[37]  Ronitt Rubinfeld,et al.  The complexity of approximating the entropy , 2002, Proceedings 17th IEEE Annual Conference on Computational Complexity.