The Dilemma between Deduplication and Locality: Can Both be Achieved?

Data deduplication is widely used to reduce the size of backup workloads, but it has the known disadvantage of causing poor data locality, also referred to as the fragmentation problem, which leads to poor restore and garbage collection (GC) performance. Current research has considered writing duplicates to maintain locality (e.g. rewriting) or caching data in memory or SSD, but fragmentation continues to hurt restore and GC performance. Investigating the locality issue, we observed that most duplicate chunks in a backup are directly from its previous backup. We therefore propose a novel management-friendly deduplication framework, called MFDedup, that maintains the locality of backup workloads by using a data classification approach to generate an optimal data layout. Specifically, we use two key techniques: Neighbor-Duplicate-Focus indexing (NDF) and Across-Version-Aware Reorganization scheme (AVAR), to perform duplicate detection against a previous backup and then rearrange chunks with an offline and iterative algorithm into a compact, sequential layout that nearly eliminates random I/O during restoration. Evaluation results with four backup datasets demonstrates that, compared with state-of-the-art techniques, MFDedup achieves deduplication ratios that are 1.12× to 2.19× higher and restore throughputs that are 2.63× to 11.64× faster due to the optimal data layout we achieve. While the rearranging stage introduces overheads, it is more than offset by a nearlyzero overhead GC process. Moreover, the NDF index only requires indexes for two backup versions, while the traditional index grows with the number of versions retained.

[1]  André Brinkmann,et al.  Block locality caching for data deduplication , 2013, SYSTOR '13.

[2]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[3]  David Hung-Chang Du,et al.  Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[4]  Kave Eshghi,et al.  A Framework for Analyzing and Improving Content-Based Chunking Algorithms , 2005 .

[5]  Danny Harnik,et al.  Sketching Volume Capacities in Deduplicated Storage , 2019, FAST.

[6]  Michael Vrable,et al.  Cumulus: Filesystem backup to the cloud , 2009, TOS.

[7]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[8]  Dan Feng,et al.  Reducing Fragmentation for In-line Deduplication Backup Storage via Exploiting Backup History and Cache Knowledge , 2016, IEEE Transactions on Parallel and Distributed Systems.

[9]  Fred Douglis,et al.  Can't We All Get Along? Redesigning Protection Storage for Modern Workloads , 2018, USENIX Annual Technical Conference.

[10]  Fred Douglis,et al.  Characteristics of backup workloads in production systems , 2012, FAST.

[11]  Medha Bhadkamkar,et al.  Getting Back Up: Understanding How Enterprise Data Backups Fail , 2016, USENIX Annual Technical Conference.

[12]  Hong Jiang,et al.  AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[13]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[14]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[15]  David Hung-Chang Du,et al.  ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching , 2018, FAST.

[16]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[17]  Fred Douglis,et al.  The Logic of Physical Garbage Collection in Deduplicating Storage , 2017, FAST.

[18]  Michal Kaczmarczyk,et al.  Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[19]  Philip Shilane,et al.  99 Deduplication Problems , 2016, HotStorage.

[20]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.

[21]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[22]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[23]  MaoBo,et al.  Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014 .

[24]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[25]  Hong Jiang,et al.  A Comprehensive Study of the Past, Present, and Future of Data Deduplication , 2016, Proceedings of the IEEE.

[26]  Ali R. Butt,et al.  DupHunter: Flexible High-Performance Deduplication for Docker Registries , 2020, USENIX Annual Technical Conference.

[27]  Dutch T. Meyer,et al.  A study of practical deduplication , 2011, TOS.

[28]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[29]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[30]  Qiang Wang,et al.  Finesse: Fine-Grained Feature Locality based Fast Resemblance Detection for Post-Deduplication Delta Compression , 2019, FAST.

[31]  Dongsu Han,et al.  mOS: A Reusable Networking Stack for Flow Monitoring Middleboxes , 2017, NSDI.

[32]  Ethan L. Miller,et al.  The effectiveness of deduplication on virtual machine disk images , 2009, SYSTOR '09.

[33]  Medha Bhadkamkar,et al.  Identifying Trends in Enterprise Data Protection Systems , 2015, USENIX Annual Technical Conference.

[34]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[35]  Erez Zadok,et al.  Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[36]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[37]  Song Jiang,et al.  RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems , 2019, SoCC.

[38]  David Hung-Chang Du,et al.  Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[39]  Hong Jiang,et al.  FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication , 2016, USENIX Annual Technical Conference.

[40]  Sungjin Lee,et al.  Improving File System Performance of Mobile Storage Systems Using a Decoupled Defragmenter , 2017, USENIX Annual Technical Conference.

[41]  William J. Bolosky,et al.  Single Instance Storage in Windows , 2000 .