Reducing impact of data fragmentation caused by in-line deduplication

Deduplication results inevitably in data fragmentation, because logically continuous data is scattered across many disk locations. In this work we focus on fragmentation caused by duplicates from previous backups of the same backup set, since such duplicates are very common due to repeated full backups containing a lot of unchanged data. For systems with in-line dedup which detects duplicates during writing and avoids storing them, such fragmentation causes data from the latest backup being scattered across older backups. As a result, the time of restore from the latest backup can be significantly increased, sometimes more than doubled. We propose an algorithm called context-based rewriting (CBR in short) minimizing this drop in restore performance for latest backups by shifting fragmentation to older backups, which are rarely used for restore. By selectively rewriting a small percentage of duplicates during backup, we can reduce the drop in restore bandwidth from 12--55% to only 4--7%, as shown by experiments driven by a set of backup traces. All of this is achieved with only small increase in writing time, between 1% and 5%. Since we rewrite only few duplicates and old copies of rewritten data are removed in the background, the whole process introduces small and temporary space overhead.

[1]  Shmuel Tomi Klein,et al.  The design of a similarity based deduplication system , 2009, SYSTOR '09.

[2]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[3]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[4]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[5]  André Brinkmann,et al.  dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[6]  Cezary Dubnicki,et al.  Anchor-driven subchunk deduplication , 2011, SYSTOR '11.

[7]  Cezary Dubnicki,et al.  FPN: a distributed hash table for commercial applications , 2004, Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004..

[8]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.

[9]  Anand Sivasubramaniam,et al.  Evaluating the usefulness of content addressable storage for high-performance data intensive applications , 2008, HPDC '08.

[10]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[11]  Michal Kaczmarczyk,et al.  HYDRAstor: A Scalable Secondary Storage , 2009, FAST.

[12]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[13]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[14]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[15]  Donald E. Eastlake,et al.  US Secure Hash Algorithm 1 (SHA1) , 2001, RFC.

[16]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[17]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.