Function of Content Defined Chunking Algorithms in Incremental Synchronization

Data chunking algorithms divide data into several small data chunks in a certain way, thus transforming the operation of data into the one of multiple small data chunks. Data chunking algorithms have been widely used in duplicate data detection, parallel computing and other fields, but it is seldom used in data incremental synchronization. Aiming at the characteristics of incremental data synchronization, this paper proposes a novel data chunking algorithm. By dividing two data that need synchronization into small data chunks, comparing the contents of these small data chunks, different ones are the incremental data that need to be found. The new algorithm determines to set a cut-point based on the number of 1 contained in the binary format of all bytes in an interval. Thus it improves the resistance against the byte shifting problem at the expense of the chunk size stability, which makes it more suitable for the incremental data synchronization. Comparing this algorithm with several known classical or state of art algorithms, experiments show that the incremental data found by this algorithm can be reduced by 32%~57% compared to the others with same changes between two data. The experimental results based on real-world datasets show that PCI improves the calculation speed of classic Rsync algorithm up to 70%, however, with a drawback of increasing the Transmission compression rate up to 11.8%.

[1]  Lijun Zhang,et al.  UCDC: Unlimited Content-Defined Chunking, A File-Differing Method Apply to File-Synchronization among Multiple Hosts , 2016, 2016 12th International Conference on Semantics, Knowledge and Grids (SKG).

[2]  Hyotaek Lim,et al.  A new content-defined chunking algorithm for data deduplication in cloud storage , 2017, Future Gener. Comput. Syst..

[3]  David M. Bradley,et al.  On the Distribution of the Sum of n Non-Identically Distributed Uniform Random Variables , 2002, math/0411298.

[4]  Melody Moh,et al.  Compression of Wearable Body Sensor Network Data Using Improved Two-Threshold-Two-Divisor Data Chunking Algorithms , 2018, 2018 International Conference on High Performance Computing & Simulation (HPCS).

[5]  Kihong Kim,et al.  Differential logging: a commutative and associative logging scheme for highly parallel main memory database , 2001, Proceedings 17th International Conference on Data Engineering.

[6]  Youjip Won,et al.  MUCH: Multithreaded Content-Based File Chunking , 2015, IEEE Transactions on Computers.

[7]  Long Chen,et al.  Block-secure: Blockchain based scheme for secure P2P cloud storage , 2018, Inf. Sci..

[8]  Yuhui Deng,et al.  LDFS: A Low Latency In-Line Data Deduplication File System , 2018, IEEE Access.

[9]  Zhi Tang,et al.  Multithread Content Based File Chunking System in CPU-GPGPU Heterogeneous Architecture , 2011, 2011 First International Conference on Data Compression, Communications and Processing.

[10]  Hong Jiang,et al.  A Fast Asymmetric Extremum Content Defined Chunking Algorithm for Data Deduplication in Backup Storage Systems , 2017, IEEE Transactions on Computers.

[11]  Randal C. Burns,et al.  In-Place Rsync: File Synchronization for Mobile and Wireless Devices , 2003, USENIX Annual Technical Conference, FREENIX Track.

[12]  Song Jiang,et al.  SS-CDC: a two-stage parallel content-defined chunking for deduplicating backup storage , 2019, SYSTOR.

[13]  G. Wagner,et al.  Evolution of Evolvability in a Developmental Model , 2008, Evolution; international journal of organic evolution.

[14]  Ruixuan Li,et al.  Does the content defined chunking really solve the local boundary shift problem? , 2017, 2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC).

[15]  Wenhao Huang,et al.  MII: A Novel Content Defined Chunking Algorithm for Finding Incremental Data in Data Synchronization , 2019, IEEE Access.

[16]  Kyung-Hyune Rhee,et al.  Efficient Client-Side Deduplication of Encrypted Data With Public Auditing in Cloud Storage , 2018, IEEE Access.

[17]  Benoît Garbinato,et al.  Throughput: A Key Performance Measure of Content-Defined Chunking Algorithms , 2016, 2016 IEEE 36th International Conference on Distributed Computing Systems Workshops (ICDCSW).

[18]  Xiangdong Huang,et al.  A Novel Approach for Video Text Detection and Recognition Based on a Corner Response Feature Map and Transferred Deep Convolutional Neural Network , 2018, IEEE Access.

[19]  Puning Zhang,et al.  Improving Quality of Data: IoT Data Aggregation Using Device to Device Communications , 2018, IEEE Access.

[20]  Yucheng Zhang,et al.  A similarity-aware encrypted deduplication scheme with flexible access control in the cloud , 2017, Future Gener. Comput. Syst..

[21]  Maria Constantinou Tuning of rsync Algorithm for Optimum Cloud Storage Performance , 2013 .

[22]  Robert M. Haralick,et al.  A method for discovering knowledge in texts , 2019, Pattern Recognit. Lett..

[23]  Elisa Bertino,et al.  Trigger Inheritance and Overriding in an Active Object Database System , 2000, IEEE Trans. Knowl. Data Eng..

[24]  Fatos Xhafa Data Replication and Synchronization in P2P Collaborative Systems , 2012, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[25]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[26]  Erez Zadok,et al.  Generating Realistic Datasets for Deduplication Analysis , 2012, USENIX Annual Technical Conference.

[27]  Nikolaj Bjørner,et al.  Content-dependent chunking for differential compression, the local maximum approach , 2010, J. Comput. Syst. Sci..

[28]  Yujuan Tan,et al.  Multi-Objective Metrics to Evaluate Deduplication Approaches , 2017, IEEE Access.

[29]  Choong Seon Hong,et al.  Online Caching and Cooperative Forwarding in Information Centric Networking , 2018, IEEE Access.

[30]  Suzhen Wu,et al.  PFP: Improving the Reliability of Deduplication-based Storage Systems with Per-File Parity , 2019, IEEE Transactions on Parallel and Distributed Systems.

[31]  Ning Wang,et al.  A distributed in-network caching scheme for P2P-like content chunk delivery , 2015, Comput. Networks.

[32]  Hwangjun Song,et al.  Progressive Caching System for Video Streaming Services Over Content Centric Network , 2019, IEEE Access.

[33]  Mark Lillibridge,et al.  Improving restore speed for backup systems that use inline chunk-based deduplication , 2013, FAST.

[34]  Takuro Sato,et al.  A Context-Aware Green Information-Centric Networking Model for Future Wireless Communications , 2018, IEEE Access.

[35]  Urs Niesen,et al.  An information-theoretic analysis of deduplication , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[36]  Jun Sun,et al.  A novel text structure feature extractor for Chinese scene text detection and recognition , 2017, 2016 23rd International Conference on Pattern Recognition (ICPR).