Efficient Data Deduplication for Big Data Storage Systems

For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).

[1]  Cezary Dubnicki,et al.  Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[2]  Chengwei Zhang,et al.  Leap-based Content Defined Chunking — Theory and Implementation , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[3]  David H. C. Du,et al.  An efficient data deduplication design with flash-memory based solid state drive , 2012 .

[4]  Jeng-Shyang Pan,et al.  Improving Accessing Efficiency of Cloud Storage Using De-Duplication and Feedback Schemes , 2014, IEEE Systems Journal.

[5]  Rahul Rawat,et al.  Bucket based data deduplication technique for big data storage system , 2016, 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO).

[6]  Naresh Kumar,et al.  Improved join operations using ORC in HIVE , 2016, CSI Transactions on ICT.

[7]  Xin Wang,et al.  QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services , 2017, IEEE Transactions on Mobile Computing.

[8]  Fang Liu,et al.  AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment , 2011, 2011 IEEE International Conference on Cluster Computing.

[9]  Youjip Won,et al.  Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  David Hung-Chang Du,et al.  Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[12]  Mark Lillibridge,et al.  Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[13]  Teng-Sheng Moh,et al.  A running time improvement for the two thresholds two divisors algorithm , 2010, ACM SE '10.

[14]  Jungmin So,et al.  Design and Implementation of Storage System Using Byte-index Chunking Scheme , 2014 .

[15]  George Varghese,et al.  EndRE: An End-System Redundancy Elimination Service for Enterprises , 2010, NSDI.

[16]  Zhou Ning,et al.  FPGA implementation of SHA-1 algorithm , 2003, ASICON 2003.

[17]  Sudipta Sengupta,et al.  Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[18]  Howard M. Heys,et al.  FPGA implementation of MD5 hash algorithm , 2001, Canadian Conference on Electrical and Computer Engineering 2001. Conference Proceedings (Cat. No.01TH8555).

[19]  Yunhao Liu,et al.  Big Data: A Survey , 2014, Mob. Networks Appl..

[20]  Hong Jiang,et al.  FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication , 2016, USENIX ATC.

[21]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[22]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[23]  Kai Li,et al.  Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[24]  Nikolaj Bjørner,et al.  Content-dependent chunking for differential compression, the local maximum approach , 2010, J. Comput. Syst. Sci..

[25]  Hong Jiang,et al.  AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[26]  Chuck Yoo,et al.  Byte-index Chunking Algorithm for Data Deduplication System , 2013 .

[27]  André Brinkmann,et al.  A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[29]  Mark Lillibridge,et al.  Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.