论文信息 - Efficient Data Deduplication for Big Data Storage Systems

Efficient Data Deduplication for Big Data Storage Systems

For efficient chunking, we propose Differential Evolution (DE) based approach which is optimized Two Thresholds Two Divisors (TTTD-P) Content Defined Chunking (CDC) to reduce the number of computing operations using single dynamic optimal parameter divisor D with optimal threshold value exploiting multi-operations nature of TTTD. To reduce chunk size variance, TTTD algorithm introduces an additional backup divisor D′ that has a higher probability of finding cut points, however, adding an additional divisor decreases chunking throughput. To this end, Asymmetric Extremum (AE) significantly improves chunking throughput by using local extreme value in a variable-sized asymmetric window to overcome Rabin and TTTD boundaries shift problem, while achieving nearby same deduplication ratio (DR). Therefore, we propose DE-based TTTD-P optimized chunking to maximize chunking throughput with increased DR; and scalable bucket indexing approach reduces hash values judgment time to identify and declare redundant chunks about 16 times than Rabin CDC, 5 times than AE CDC, 1.6 times than FAST CDC on Hadoop Distributed File System (HDFS).

Naresh Kumar | Shobha | S. C. Jain

[1] Cezary Dubnicki,et al. Bimodal Content Defined Chunking for Backup Streams , 2010, FAST.

[2] Chengwei Zhang,et al. Leap-based Content Defined Chunking — Theory and Implementation , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[3] David H. C. Du,et al. An efficient data deduplication design with flash-memory based solid state drive , 2012 .

[4] Jeng-Shyang Pan,et al. Improving Accessing Efficiency of Cloud Storage Using De-Duplication and Feedback Schemes , 2014, IEEE Systems Journal.

[5] Rahul Rawat,et al. Bucket based data deduplication technique for big data storage system , 2016, 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO).

[6] Naresh Kumar,et al. Improved join operations using ORC in HIVE , 2016, CSI Transactions on ICT.

[7] Xin Wang,et al. QuickSync: Improving Synchronization Efficiency for Mobile Cloud Storage Services , 2017, IEEE Transactions on Mobile Computing.

[8] Fang Liu,et al. AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment , 2011, 2011 IEEE International Conference on Cluster Computing.

[9] Youjip Won,et al. Efficient Deduplication Techniques for Modern Backup Operation , 2011, IEEE Transactions on Computers.

[10] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11] David Hung-Chang Du,et al. Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[12] Mark Lillibridge,et al. Extreme Binning: Scalable, parallel deduplication for chunk-based file backup , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[13] Teng-Sheng Moh,et al. A running time improvement for the two thresholds two divisors algorithm , 2010, ACM SE '10.

[14] Jungmin So,et al. Design and Implementation of Storage System Using Byte-index Chunking Scheme , 2014 .

[15] George Varghese,et al. EndRE: An End-System Redundancy Elimination Service for Enterprises , 2010, NSDI.

[16] Zhou Ning,et al. FPGA implementation of SHA-1 algorithm , 2003, ASICON 2003.

[17] Sudipta Sengupta,et al. Primary Data Deduplication - Large Scale Study and System Design , 2012, USENIX Annual Technical Conference.

[18] Howard M. Heys,et al. FPGA implementation of MD5 hash algorithm , 2001, Canadian Conference on Electrical and Computer Engineering 2001. Conference Proceedings (Cat. No.01TH8555).

[19] Yunhao Liu,et al. Big Data: A Survey , 2014, Mob. Networks Appl..

[20] Hong Jiang,et al. FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication , 2016, USENIX ATC.

[21] Fred Douglis,et al. Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[22] Rainer Storn,et al. Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[23] Kai Li,et al. Avoiding the Disk Bottleneck in the Data Domain Deduplication File System , 2008, FAST.

[24] Nikolaj Bjørner,et al. Content-dependent chunking for differential compression, the local maximum approach , 2010, J. Comput. Syst. Sci..

[25] Hong Jiang,et al. AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[26] Chuck Yoo,et al. Byte-index Chunking Algorithm for Data Deduplication System , 2013 .

[27] André Brinkmann,et al. A study on data deduplication in HPC storage systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[28] David Mazières,et al. A low-bandwidth network file system , 2001, SOSP.

[29] Mark Lillibridge,et al. Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality , 2009, FAST.