Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets

Data deduplication has been widely adopted in contemporary backup storage systems. It not only saves storage space considerably, but also shortens the data backup time significantly. Since the major goal of the original data deduplication lies in saving storage space, its design has been focused primarily on improving write performance by removing as many duplicate data as possible from incoming data streams. Although fast recovery from a system crash relies mainly on read performance provided by deduplication storage, little investigation into read performance improvement has been made. In general, as the amount of deduplicated data increases, write performance improves accordingly, whereas associated read performance becomes worse. In this paper, we newly propose a deduplication scheme that assures demanded read performance of each data stream while achieving its write performance at a reasonable level, eventually being able to guarantee a target system recovery time. For this, we first propose an indicator called cache aware Chunk Fragmentation Level (CFL) that estimates degraded read performance on the fly by taking into account both incoming chunk information and read cache effects. We also show a strong correlation between this CFL and read performance in the backup datasets. In order to guarantee demanded read performance expressed in terms of a CFL value, we propose a read performance enhancement scheme called selective duplication that is activated whenever the current CFL becomes worse than the demanded one. The key idea is to judiciously write non-unique (shared) chunks into storage together with unique chunks unless the shared chunks exhibit good enough spatial locality. We quantify the spatial locality by using a selective duplication threshold value. Our experiments with the actual backup datasets demonstrate that the proposed scheme achieves demanded read performance in most cases at the reasonable cost of write performance.

[1]  David Hung-Chang Du,et al.  Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[2]  Takashi Watanabe,et al.  DBLK: Deduplication for primary block storage , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[3]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[4]  Petros Efstathopoulos,et al.  Rethinking Deduplication Scalability , 2010, HotStorage.

[5]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[6]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[7]  David J. Lilja,et al.  Characterizing datasets for data deduplication in backup applications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[8]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[9]  David Hung-Chang Du,et al.  BloomStore: Bloom-Filter based memory-efficient key-value store for indexing of data deduplication on flash , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[11]  David Hung-Chang Du,et al.  Chunk Fragmentation Level: An Effective Indicator for Read Performance Degradation in Deduplication Storage , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[12]  Chanik Park,et al.  Regulating I/O Performance of Shared Storage with a Control Theoretical Approach , 2004, MSST.