Improved file synchronization techniques for maintaining large replicated collections over slow networks

We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distribution and Web caching networks, Web site mirroring, storage networks, and large scale Web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an outdated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice. We propose a framework for remote file synchronization and describe several new techniques that result in significant bandwidth savings. Our focus is on applications where very large collections have to be maintained over slow connections. We show that a prototype implementation of our framework and techniques achieves significant improvements over rsync. As an example application, we focus on the efficient synchronization of very large Web page collections for the purpose of search, mining, and content distribution.

[1]  David G. Korn,et al.  Engineering a Differencing and Compression Data Format , 2002, USENIX Annual Technical Conference, General Track.

[2]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[3]  Alon Orlitsky,et al.  One-way communication and error-correcting codes , 2003, IEEE Transactions on Information Theory.

[4]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[5]  John J. Metzner,et al.  A Parity Structure for Large Remotely Located Replicated Data Files , 1983, IEEE Transactions on Computers.

[6]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[7]  Alon Orlitsky,et al.  Worst-case interactive communication - II: Two messages are not optimal , 1991, IEEE Trans. Inf. Theory.

[8]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[9]  Tom Madej,et al.  An application of group testing to the file comparison problem , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[10]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[11]  S. Ulam,et al.  Adventures of a Mathematician , 2019, Mathematics: People · Problems · Results.

[12]  Alexandre V. Evfimievski A probabilistic algorithm for updating files over a communication link , 1998, SODA '98.

[13]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[14]  RamseyNorman,et al.  An algebraic approach to file synchronization , 2001 .

[15]  John J. Metzner,et al.  Efficient Replicated Remote File Comparison , 1991, IEEE Trans. Computers.

[16]  Andrzej Pelc,et al.  Searching games with errors - fifty years of coping with liars , 2002, Theor. Comput. Sci..

[17]  Robert W. Bowdidge,et al.  Low cost comparisons of file copies , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[18]  Sachin Agarwal,et al.  Efficient PDA Synchronization , 2003, IEEE Trans. Mob. Comput..

[19]  Alexander S. Szalay,et al.  TeraScale SneakerNet: Using Inexpensive Disks for Backup, Archiving, and Data Exchange , 2002, ArXiv.

[20]  David Mazières,et al.  A low-bandwidth network file system , 2001, SOSP.

[21]  Uzi Vishkin,et al.  Communication complexity of document exchange , 1999, SODA '00.

[22]  Khaled A. S. Abdel-Ghaffar,et al.  An Optimal Strategy for Comparing File Copies , 1994, IEEE Trans. Parallel Distributed Syst..

[23]  Sachin Agarwal,et al.  On the scalability of data synchronization protocols for PDAs and mobile devices , 2002, IEEE Netw..

[24]  Norman Ramsey,et al.  An algebraic approach to file synchronization , 2001, ESEC/FSE-9.

[25]  Alon Orlitsky,et al.  Practical protocols for interactive communication , 2001, Proceedings. 2001 IEEE International Symposium on Information Theory (IEEE Cat. No.01CH37252).

[26]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[27]  Eric A. Brewer,et al.  Value-based web caching , 2003, WWW '03.

[28]  Hector Garcia-Molina Webbase: building a web warehouse , 2004, Proceedings of the Fifth Mexican International Conference in Computer Science, 2004. ENC 2004..

[29]  Joshua P. MacDonald,et al.  File System Support for Delta Compression , 2000 .

[30]  John J. Metzner,et al.  Efficient Location of Discrepancies in Multiple Replicated Large Files , 2002, IEEE Trans. Parallel Distributed Syst..

[31]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[32]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[33]  Forouzan Golshani,et al.  Proceedings of the Eighth International Conference on Data Engineering , 1992 .

[34]  Randal C. Burns,et al.  In-Place Rsync: File Synchronization for Mobile and Wireless Devices , 2003, USENIX Annual Technical Conference, FREENIX Track.

[35]  Torsten Suel,et al.  Algorithms for Delta Compression and Remote File Synchronization , 2003 .

[36]  Richard J. Lipton,et al.  A Class of Randomized Strategies for Low-Cost Comparison of File Copies , 1991, IEEE Trans. Parallel Distributed Syst..

[37]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[38]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[39]  Benjamin C. Pierce,et al.  What is a file synchronizer? , 1998, MobiCom '98.

[40]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[41]  Alon Orlitsky Interactive Communication of Balanced Distributions and of Correlated Files , 1993, SIAM J. Discret. Math..

[42]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[43]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[44]  Joachim Hammer,et al.  Using the Web Efficiently: Mobile Crawlers , 1998 .

[45]  R. Dorfman The Detection of Defective Members of Large Populations , 1943 .