A Two-Phase Differential Synchronization Algorithm for Remote Files

This paper presents a two-phase synchronization algorithm—tpsync, which combines content-defined chunking (CDC) with sliding block duplicated data detection methods tpsync firstly partitions synchronized files into variable-sized chunks in coarse-grained scale with CDC method, locates the unmatched chunks of synchronized files using the edit distance algorithm, and finally generates the fine-grained delta data with fixed-sized sliding block duplicated data detection method At the first-phase, tpsync can quickly locate the partial changed chunks between two files through similar files' fingerprint characteristics On the basis of the first phase's results, small fixed-sized sliding block duplicated data detection method can produce better fine-grained delta data between the corresponding unmatched data chunks further Extensive experiments on ASCII, binary and database files demonstrate that tpsync can achieve a higher performance on synchronization time and total transferred data compared to traditional fixed-sized sliding block method—rsync Compared to rsync, tpsync reduces synchronization time by 12% and bandwidth by 18.9% on average if optimized parameters are applied on both With signature cached synchronization method adopted, tpsync can yield a better performance.

[1]  Ben Collins-Sussman,et al.  Version Control with Subversion, Second Edition , 2008 .

[2]  Windsor W. Hsu,et al.  Duplicate Management for Reference Data , 2004 .

[3]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[4]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Michael Pilato Version Control with Subversion , 2004 .

[7]  David G. Korn,et al.  Engineering a Differencing and Compression Data Format , 2002, USENIX Annual Technical Conference, General Track.

[8]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[9]  Torsten Suel,et al.  zdelta: An efficient delta compression tool , 2002 .

[10]  Suresh Jagannathan,et al.  Improving duplicate elimination in storage systems , 2006, TOS.

[11]  Nikolaj Bjørner,et al.  Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression , 2006 .

[12]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[13]  Michael Dahlin,et al.  TAPER: tiered approach for eliminating redundancy in replica synchronization , 2005, FAST'05.

[14]  Thomas G. Szymanski,et al.  A fast algorithm for computing longest common subsequences , 1977, CACM.

[15]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[16]  Ronald Fagin,et al.  Compactly encoding unstructured inputs with differential compression , 2002, JACM.

[17]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[18]  Ben Y. Zhao,et al.  OceanStore: An Extremely Wide-Area Storage System , 2002, ASPLOS 2002.