Improved single-round protocols for remote file synchronization

Given two versions of a file, a current version located on one machine and an outdated version known only to another machine, the remote file synchronization problem is how to update the outdated version over a network with a minimal amount of communication. In particular, when the versions are very similar, the total data transmitted should be significantly smaller than the file size. File synchronization problems arise in many application scenarios such as Web site mirroring, file system backup and replication, and Web access over slow links. An open source tool for this problem, called rsync and included in many Linux distributions, is widely used in such scenarios, rsync uses a single round of messages between the two machines. While recent research has shown that significant additional savings in bandwidth consumption are possible through the use of optimized multi-round protocols, there are many scenarios where multiple rounds are undesirable. In this paper, we study single-round protocols for file synchronization that offer significant improvements over rsync. Our main contribution is a new approach to file synchronization based on the use of erasure codes. Using this approach, we design a single-round protocol that is provably efficient with respect to common measures of file distance, and another optimized practical protocol that shows promising improvements over rsync on our data sets. In addition, we show how to obtain moderate improvements by engineering the rsync approach.

[1]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[2]  David G. Korn,et al.  Engineering a Differencing and Compression Data Format , 2002, USENIX Annual Technical Conference, General Track.

[3]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[4]  Alon Orlitsky,et al.  One-way communication and error-correcting codes , 2003, IEEE Transactions on Information Theory.

[5]  Graham Cormode,et al.  Sequence distance embeddings , 2003 .

[6]  Sachin Agarwal,et al.  On the scalability of data synchronization protocols for PDAs and mobile devices , 2002, IEEE Netw..

[7]  Khaled A. S. Abdel-Ghaffar,et al.  An Optimal Strategy for Comparing File Copies , 1994, IEEE Trans. Parallel Distributed Syst..

[8]  Michael Mitzenmacher,et al.  A digital fountain approach to asynchronous reliable multicast , 2002, IEEE J. Sel. Areas Commun..

[9]  Luigi Rizzo,et al.  Effective erasure codes for reliable computer communication protocols , 1997, CCRV.

[10]  Tom Madej,et al.  An application of group testing to the file comparison problem , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[11]  Alon Orlitsky,et al.  Worst-case interactive communication - II: Two messages are not optimal , 1991, IEEE Trans. Inf. Theory.

[12]  Michael Luby,et al.  A digital fountain approach to reliable distribution of bulk data , 1998, SIGCOMM '98.

[13]  Alon Orlitsky Interactive Communication of Balanced Distributions and of Correlated Files , 1993, SIAM J. Discret. Math..

[14]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[15]  Norman Ramsey,et al.  An algebraic approach to file synchronization , 2001, ESEC/FSE-9.

[16]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[17]  Torsten Suel,et al.  Improved file synchronization techniques for maintaining large replicated collections over slow networks , 2004, Proceedings. 20th International Conference on Data Engineering.

[18]  Robert W. Bowdidge,et al.  Low cost comparisons of file copies , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[19]  Eric A. Brewer,et al.  Value-based web caching , 2003, WWW '03.

[20]  Richard J. Lipton,et al.  A Class of Randomized Strategies for Low-Cost Comparison of File Copies , 1991, IEEE Trans. Parallel Distributed Syst..

[21]  John J. Metzner,et al.  A Parity Structure for Large Remotely Located Replicated Data Files , 1983, IEEE Transactions on Computers.

[22]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[23]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[24]  John J. Metzner,et al.  Efficient Location of Discrepancies in Multiple Replicated Large Files , 2002, IEEE Trans. Parallel Distributed Syst..

[25]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[26]  Ari Trachtenberg,et al.  Reconciliation puzzles [separately hosted strings reconciliation] , 2004, IEEE Global Telecommunications Conference, 2004. GLOBECOM '04..

[27]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[28]  Randal C. Burns,et al.  In-Place Rsync: File Synchronization for Mobile and Wireless Devices , 2003, USENIX Annual Technical Conference, FREENIX Track.

[29]  Torsten Suel,et al.  Algorithms for Delta Compression and Remote File Synchronization , 2003 .

[30]  Sachin Agarwal,et al.  Efficient PDA Synchronization , 2003, IEEE Trans. Mob. Comput..

[31]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[32]  Alexandre V. Evfimievski A probabilistic algorithm for updating files over a communication link , 1998, SODA '98.

[33]  Benjamin C. Pierce,et al.  What is a file synchronizer? , 1998, MobiCom '98.

[34]  John J. Metzner,et al.  Efficient Replicated Remote File Comparison , 1991, IEEE Trans. Computers.