LDFS: A Low Latency In-Line Data Deduplication File System

Due to the rapid proliferation of sensors and intelligent devices, the cyber-physical-social computing and networking (CPSCN) is emerging as a new computing paradigm. Massive data have been generated in the CPSCN environment. The traditional data deduplication is not able to handle the CPSCN environment due to the involved long latency. This paper presents a low latency in-line data deduplication file system (LDFS). The LDFS decouples the unique data block and fingerprint index by writing the address of data blocks to the corresponding file recipe and fingerprint index, thus avoiding accessing fingerprint index on the path of the read operation. For every unique data block, the LDFS assigns a globally unique ID, and thus, the LDFS only requires one disk access to obtain the corresponding data block reference count using the global ID. In order to guarantee the write performance, the LDFS employs finer granularity lock to optimize the block flushing strategy of write buffer. Experimental results demonstrate that the LDFS significantly enhances the read and write performance on the critical path in contrast to the traditional deduplication file system LessFS. Meanwhile, the LDFS achieves almost the same deduplication ratio (40.8) as that of LessFS.

[1]  André Brinkmann,et al.  Block locality caching for data deduplication , 2013, SYSTOR '13.

[2]  David Hung-Chang Du,et al.  Assuring Demanded Read Performance of Data Deduplication Storage with Backup Datasets , 2012, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[3]  André Brinkmann,et al.  dedupv1: Improving deduplication throughput using solid state drives (SSD) , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Hong Jiang,et al.  CABdedupe: A Causality-Based Deduplication Performance Booster for Cloud Backup Services , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[5]  Yuhui Deng,et al.  Exploring the performance impact of stripe size on network attached storage systems , 2008, J. Syst. Archit..

[6]  Yuhui Deng,et al.  Exploiting Fingerprint Prefetching to Improve the Performance of Data Deduplication , 2013, 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing.

[7]  Yuhui Deng,et al.  What is the future of disk drives, death or rebirth? , 2011, ACM Comput. Surv..

[8]  Timothy Bisson,et al.  iDedup: latency-aware, inline data deduplication for primary storage , 2012, FAST.

[9]  Kai Li,et al.  Tradeoffs in Scalable Data Routing for Deduplication Clusters , 2011, FAST.

[10]  Hong Jiang,et al.  Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud , 2014, TOS.

[11]  Jeffrey Katcher,et al.  PostMark: A New File System Benchmark , 1997 .

[12]  Philip Shilane,et al.  Delta Compressed and Deduplicated Storage Using Stream-Informed Locality , 2012, HotStorage.

[13]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[14]  Hong Jiang,et al.  SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput , 2011, USENIX Annual Technical Conference.

[15]  Yuhui Deng,et al.  Leverage similarity and locality to enhance fingerprint prefetching of data deduplication , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[16]  Dan Feng,et al.  Accelerating Restore and Garbage Collection in Deduplication-based Backup Systems via Exploiting Historical Information , 2014, USENIX Annual Technical Conference.

[17]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[18]  Michal Kaczmarczyk,et al.  Reducing impact of data fragmentation caused by in-line deduplication , 2012, SYSTOR '12.

[19]  Yuhui Deng,et al.  An Incrementally Scalable and Cost-Efficient Interconnection Structure for Data Centers , 2017, IEEE Transactions on Parallel and Distributed Systems.

[20]  Ilya,et al.  Tokyo Cabinet:超越键/值存储 , 2009 .

[21]  Jin Li,et al.  ChunkStash: Speeding Up Inline Storage Deduplication Using Flash Memory , 2010, USENIX Annual Technical Conference.

[22]  Edwin Hsing-Mean Sha,et al.  Reducing the De-linearization of Data Placement to Improve Deduplication Performance , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[23]  Pramodita Sharma 2012 , 2013, Les 25 ans de l’OMC: Une rétrospective en photos.

[24]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[25]  Petros Efstathopoulos,et al.  Building a High-performance Deduplication System , 2011, USENIX Annual Technical Conference.

[26]  Hong Jiang,et al.  A Scalable Inline Cluster Deduplication Framework for Big Data Protection , 2012, Middleware.

[27]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[28]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[29]  Yucheng Zhang,et al.  Design Tradeoffs for Data Deduplication Performance in Backup Workloads , 2015, FAST.