MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization

In the data backup system, to reduce the bandwidth and processing time overhead caused by full backup technology during data synchronization between backups and source data, incremental backup technology is emerging as the focus of academic and industrial research. It is key but poorly-solved to find the incremental data between backups and source data for incremental backup technology. To find out the incremental data during the backup process, here, in this paper, we propose a novel content-defined chunking algorithm. The source data and backup data are chunked into some small chunks in the same way with the variable length. Then, by comparing whether a chunk of source data is different from any of the chunks in backup data, we can evaluate whether the chunk of source data is incremental data. By experiments, the chunking algorithm in this paper is compared to other ones which are the classical or state-of-the-art algorithms. The experimental results show that the incremental data found by this algorithm can be reduced by 13%–34% compared to the others with the same chunk throughput.

[1]  Nikolaj Bjørner,et al.  Content-dependent chunking for differential compression, the local maximum approach , 2010, J. Comput. Syst. Sci..

[2]  Xiaowei Liu,et al.  SBBS: A sliding blocking algorithm with backtracking sub-blocks for duplicate data detection , 2014, Expert Syst. Appl..

[3]  Shmuel Tomi Klein,et al.  Similarity based deduplication with small data chunks , 2016, Discret. Appl. Math..

[4]  Jun Sun,et al.  A novel text structure feature extractor for Chinese scene text detection and recognition , 2017, 2016 23rd International Conference on Pattern Recognition (ICPR).

[5]  Ning Wang,et al.  A distributed in-network caching scheme for P2P-like content chunk delivery , 2015, Comput. Networks.

[6]  Yujuan Tan,et al.  Multi-Objective Metrics to Evaluate Deduplication Approaches , 2017, IEEE Access.

[7]  Youjip Won,et al.  MUCH: Multithreaded Content-Based File Chunking , 2015, IEEE Transactions on Computers.

[8]  Hwangjun Song,et al.  Progressive Caching System for Video Streaming Services Over Content Centric Network , 2019, IEEE Access.

[9]  Hong Jiang,et al.  Accelerating content-defined-chunking based data deduplication by exploiting parallelism , 2019, Future Gener. Comput. Syst..

[10]  Zhi Tang,et al.  Multithread Content Based File Chunking System in CPU-GPGPU Heterogeneous Architecture , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[11]  Randal C. Burns,et al.  In-Place Rsync: File Synchronization for Mobile and Wireless Devices , 2003, USENIX Annual Technical Conference, FREENIX Track.

[12]  Youlong Luo,et al.  Collaborative cache allocation and task scheduling for data-intensive applications in edge computing environment , 2019, Future Gener. Comput. Syst..

[13]  Choong Seon Hong,et al.  Online Caching and Cooperative Forwarding in Information Centric Networking , 2018, IEEE Access.

[14]  Zhao Liu,et al.  MVP2P: Layer-Dependency-Aware Live MVC Video Streaming over Peer-to-Peer Networks , 2017, Signal Process. Image Commun..

[15]  Lijun Zhang,et al.  UCDC: Unlimited Content-Defined Chunking, A File-Differing Method Apply to File-Synchronization among Multiple Hosts , 2016, 2016 12th International Conference on Semantics, Knowledge and Grids (SKG).

[16]  Ruixuan Li,et al.  Does the content defined chunking really solve the local boundary shift problem? , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).

[17]  Hyotaek Lim,et al.  A new content-defined chunking algorithm for data deduplication in cloud storage , 2017, Future Gener. Comput. Syst..

[18]  Anand Bhalerao,et al.  A survey: On data deduplication for efficiently utilizing cloud storage for big data backups , 2017, 2017 International Conference on Trends in Electronics and Informatics (ICEI).

[19]  David Hung-Chang Du,et al.  TDDFS: A Tier-Aware Data Deduplication-Based File System , 2019, ACM Trans. Storage.

[20]  MyungKeun Yoon A constant-time chunking algorithm for packet-level deduplication , 2019, ICT Express.

[21]  Kihong Kim,et al.  Differential logging: a commutative and associative logging scheme for highly parallel main memory database , 2001, Proceedings 17th International Conference on Data Engineering.

[22]  Xiaolan Fu,et al.  Dataset of implicit sequence learning of chunking and abstract structures , 2019, Data in brief.

[23]  Jian Shen,et al.  Secure similarity-based cloud data deduplication in Ubiquitous city , 2017, Pervasive Mob. Comput..

[24]  Long Chen,et al.  Block-secure: Blockchain based scheme for secure P2P cloud storage , 2018, Inf. Sci..

[25]  Cezary Dubnicki,et al.  Anchor-driven subchunk deduplication , 2011, SYSTOR '11.

[26]  Takuro Sato,et al.  A Context-Aware Green Information-Centric Networking Model for Future Wireless Communications , 2018, IEEE Access.

[27]  Kyung-Hyune Rhee,et al.  Efficient Client-Side Deduplication of Encrypted Data With Public Auditing in Cloud Storage , 2018, IEEE Access.

[28]  Benoît Garbinato,et al.  Throughput: A Key Performance Measure of Content-Defined Chunking Algorithms , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems Workshops (ICDCSW).

[29]  Biju Abraham Narayamparambil,et al.  A Proposal for Improving Data Deduplication with Dual Side Fixed Size Chunking Algorithm , 2013, 2013 Third International Conference on Advances in Computing and Communications.

[30]  Hong Jiang,et al.  Ddelta: A deduplication-inspired fast delta compression approach , 2014, Perform. Evaluation.

[31]  Maria Constantinou Tuning of rsync Algorithm for Optimum Cloud Storage Performance , 2013 .

[32]  Mohammad Mehedi Hassan,et al.  Performance Analysis of Personal Cloud Storage Services for Mobile Multimedia Health Record Management , 2018, IEEE Access.

[33]  Krishna Kant,et al.  Software defined deduplicated replica management in scale-out storage systems , 2019, Future Gener. Comput. Syst..

[34]  Robert M. Haralick,et al.  A method for discovering knowledge in texts , 2019, Pattern Recognit. Lett..

[35]  Song Jiang,et al.  SS-CDC: a two-stage parallel content-defined chunking for deduplicating backup storage , 2019, SYSTOR.

[36]  Melody Moh,et al.  Compression of Wearable Body Sensor Network Data Using Improved Two-Threshold-Two-Divisor Data Chunking Algorithms , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[37]  Xue Liu,et al.  CCDN: Content-Centric Data Center Networks , 2016, IEEE/ACM Transactions on Networking.

[38]  Yuhui Deng,et al.  LDFS: A Low Latency In-Line Data Deduplication File System , 2018, IEEE Access.

[39]  Hong Jiang,et al.  A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems , 2017, IEEE Transactions on Computers.

[40]  Zhen-Hua Ling,et al.  A Sequential Neural Encoder With Latent Structured Description for Modeling Sentences , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Siwei Luo,et al.  A novel chunk coalescing algorithm for data deduplication in cloud storage , 2013, 2013 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT).

[42]  Elisa Bertino,et al.  Trigger Inheritance and Overriding in an Active Object Database System , 2000, IEEE Trans. Knowl. Data Eng..

[43]  Fatos Xhafa Data Replication and Synchronization in P2P Collaborative Systems , 2012, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[44]  Yucheng Zhang,et al.  A similarity-aware encrypted deduplication scheme with flexible access control in the cloud , 2017, Future Gener. Comput. Syst..

[45]  Xin Jin,et al.  Valuation of information and the associated overpayment problem in peer-to-peer systems , 2016, Comput. Commun..

[46]  David Hung-Chang Du,et al.  Frequency Based Chunking for Data De-Duplication , 2010, 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[47]  Jing Zhang A Data Synchronization Method Oriented to Custom Hierarchical Multi-node System , 2015, 2015 IEEE International Conference on Computational Intelligence & Communication Technology.

[48]  Erez Zadok,et al.  Cluster and Single-Node Analysis of Long-Term Deduplication Patterns , 2018, ACM Trans. Storage.

[49]  Xiangdong Huang,et al.  A Novel Approach for Video Text Detection and Recognition Based on a Corner Response Feature Map and Transferred Deep Convolutional Neural Network , 2018, IEEE Access.

[50]  Puning Zhang,et al.  Improving Quality of Data: IoT Data Aggregation Using Device to Device Communications , 2018, IEEE Access.