PDFS: Partially Dedupped File System for Primary Workloads

Primary storage dedup is difficult to be accomplished because of challenges to achieve low IO latency and high throughput while eliminating data redundancy effectively in the critical IO Path. In this paper, we design and implement the PDFS, a partially dedupped file system for primary workloads, which is built on a generalized framework using partial data lookup for efficient searching of redundant data in quickly chosen data subsets instead of the whole data. PDFS improves IO latency and throughput systematically by techniques including write path optimization, data dedup parallelization and write order preserving. Such design choices bring dedup to the masses for general primary workloads. Experimental results show that PDFS achieves 74-99 percent of the theoretical maximum dedup ratio with very small or even negative performance degradations compared with main stream file systems without dedup support. Discussions about varied configuring experiences of PDFS are also carried out.

[1]  George Varghese,et al.  An Improved Construction for Counting Bloom Filters , 2006, ESA.

[2]  Prateek Sharma,et al.  Singleton: system-wide page deduplication in virtual environments , 2012, HPDC '12.

[3]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[4]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[5]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[6]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[7]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[8]  Peter Desnoyers,et al.  Memory buddies: exploiting page sharing for smart colocation in virtualized data centers , 2009, VEE '09.

[9]  João Paulo,et al.  A Survey and Classification of Storage Deduplication Systems , 2014, ACM Comput. Surv..

[10]  Pangfeng Liu,et al.  An Empirical Study on Memory Sharing of Virtual Machines for Server Consolidation , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[11]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[12]  Philip Shilane,et al.  Memory efficient sanitization of a deduplicated storage system , 2013, FAST.

[13]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[14]  Irfan Ahmad,et al.  Decentralized Deduplication in SAN Cluster File Systems , 2009, USENIX Annual Technical Conference.

[15]  George Varghese,et al.  Difference engine , 2010, OSDI.

[16]  John C. S. Lui,et al.  Live Deduplication Storage of Virtual Machine Images in an Open-Source Cloud , 2011, Middleware.

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  Sanjeev Kumar,et al.  Finding a Needle in Haystack: Facebook's Photo Storage , 2010, OSDI.

[19]  Andrea C. Arpaci-Dusseau,et al.  End-to-end Data Integrity for File Systems: A ZFS Case Study , 2010, FAST.

[20]  Prashant J. Shenoy,et al.  An Empirical Study of Memory Sharing in Virtual Machines , 2012, USENIX Annual Technical Conference.

[21]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[22]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[23]  Takashi Watanabe,et al.  DBLK: Deduplication for primary block storage , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[24]  Ethan L. Miller,et al.  HANDS: A heuristically arranged non-backup in-line deduplication system , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[25]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[26]  M. Mitzenmacher,et al.  Bloom Filters via d-Left Hashing and Dynamic Bit Reassignment Extended Abstract , 2006 .

[27]  João Paulo,et al.  DEDISbench: A Benchmark for Deduplicated Storage Systems , 2012, OTM Conferences.

[28]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[29]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[30]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[31]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Hong Jiang,et al.  POD: Performance Oriented I/O Deduplication for Primary Storage Systems in the Cloud , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[33]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[34]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[35]  Garth A. Gibson,et al.  Scale and Concurrency of GIGA+: File System Directories with Millions of Files , 2011, FAST.

[36]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[37]  Shmuel Tomi Klein,et al.  Similarity based deduplication with small data chunks , 2016, Discret. Appl. Math..

[38]  William J. Bolosky,et al.  Single Instance Storage in Windows , 2000 .

[39]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[40]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.