Leach: an automatic learning cache for inline primary deduplication system

Deduplication technology has been increasingly used to reduce storage costs. Though it has been successfully applied to backup and archival systems, existing techniques can hardly be deployed in primary storage systems due to the associated latency cost of detecting duplicated data, where every unit has to be checked against a substantially large fingerprint index before it is written. In this paper we introduce Leach, for inline primary storage, a self-learning in-memory fingerprints cache to reduce the writing cost in deduplication system. Leach is motivated by the characteristics of real-world I/O workloads: highly data skew exist in the access patterns of duplicated data. Leach adopts a splay tree to organize the on-disk fingerprint index, automatically learns the access patterns and maintains hot working sets in cachememory, with a goal to service a majority of duplicated data detection. Leveraging the working set property, Leach provides optimization to reduce the cost of splay operations on the fingerprint index and cache updates. In comprehensive experiments on several real-world datasets, Leach outperforms conventional LRU (least recently used) cache policy by reducing the number of cache misses, and significantly improves write performance without great impact to cache hits.

[1]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[2]  Pieter H. Hartel,et al.  Optimizing MEMS-based storage devices for mobile battery-powered systems , 2010, TOS.

[3]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[4]  Kurt Mehlhorn,et al.  Self-Adjusting Binary Search Trees: What Makes Them Tick? , 2015, ESA.

[5]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[6]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[7]  Ethan L. Miller,et al.  HANDS: A heuristically arranged non-backup in-line deduplication system , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[8]  David Geer Reducing the Storage Burden via Data Deduplication , 2008, Computer.

[9]  Ni Lar Thein,et al.  Improved Live VM Migration using LRU and Splay Tree , 2012 .

[10]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[11]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[12]  Jason Flinn,et al.  Proceedings of the 10th USENIX conference on File and Storage Technologies, FAST 2012, San Jose, CA, USA, February 14-17, 2012 , 2012, FAST.

[13]  Kenneth Salem,et al.  Adaptive block rearrangement , 1993, TOCS.

[14]  Tei-Wei Kuo,et al.  A driver-layer caching policy for removable storage devices , 2011, TOS.

[15]  Margo Seltzer,et al.  Proccedings of the 7th conference on File and storage technologies , 2009 .

[16]  Scott D. Carson,et al.  A system for adaptive disk rearrangement , 1990, Softw. Pract. Exp..

[17]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[18]  Maohua Lu,et al.  Insights for data reduction in primary storage: a practical analysis , 2012, SYSTOR '12.