Proceedings of the General Track: 2004 USENIX Annual Technical Conference

Storage systems frequently maintain identical copies of data. Identifying such data can assist in the design of solutions in which data storage, transmission, and management are optimised. In this paper we evaluate three methods used to discover identical portions of data: whole file content hashing, fixed size blocking, and a chunking strategy that uses Rabin fingerprints to delimit content-defined data chunks. We assess how effective each of these strategies is in finding identical sections of data. In our experiments, we analysed diverse data sets from a variety of different types of storage systems including a mirrored section of sunsite.org.uk, different data profiles in the file system infrastructure of the Cambridge University Computer Laboratory, source code distribution trees, compressed data, and packed files. We report our experimental results and present a comparative analysis of these techniques. This study also shows how levels of similarity differ between data sets and file types. Finally, we discuss the advantages and disadvantages in the application of these methods in the light of our experimental results.

[1]  Monica S. Lam,et al.  Optimizing the migration of virtual computers , 2002, OPSR.

[2]  William J. Bolosky,et al.  Single Instance Storage in Windows , 2000 .

[3]  Michael Stonebraker,et al.  Data replication in Mariposa , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[4]  Val Henson,et al.  An Analysis of Compare-by-hash , 2003, HotOS.

[5]  Peter L. Reiher,et al.  Rumor: Mobile Data Access Through Optimistic Peer-to-Peer Replication , 1998, ER Workshops.

[6]  William J. Bolosky,et al.  Single instance storage in Windows® 2000 , 2000 .

[7]  Ian Pratt,et al.  Xenoservers: accountable execution of untrusted programs , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[8]  Timothy L. Harris,et al.  Storage, Mutability and Naming in Pasta , 2002, NETWORKING Workshops.

[9]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[10]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[11]  M. Rabin Discovering Repetitions in Strings , 1985 .

[12]  W. Vogels File system usage in Windows NT 4.0 , 2000, OPSR.

[13]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[14]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[15]  Krste Asanovic,et al.  Energy-aware lossless data compression , 2006, TOCS.

[16]  Windsor W. Hsu,et al.  Duplicate Management for Reference Data , 2004 .

[17]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[18]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[19]  Andrew Tridgell,et al.  Efficient Algorithms for Sorting and Synchronization , 1999 .

[20]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[21]  Benjamin C. Pierce,et al.  What is a file synchronizer? , 1998, MobiCom '98.

[22]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[23]  Calvin Chan,et al.  CMPUT690 Term Project Fingerprinting using Polynomial (Rabin's method) , 2001 .

[24]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[25]  Andrei Z. Broder,et al.  Identifying and Filtering Near-Duplicate Documents , 2000, CPM.

[26]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[27]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[28]  Eric A. Brewer,et al.  Value-based web caching , 2003, WWW '03.