Efficient Similarity Estimation for Systems Exploiting Data Redundancy

Many modern systems exploit data redundancy to improve efficiency. These systems split data into chunks, generate identifiers for each of them, and compare the identifiers among other data items to identify duplicate chunks. As a result, chunk size becomes a critical parameter for the efficiency of these systems: it trades potentially improved similarity detection (smaller chunks) with increased overhead to represent more chunks. Unfortunately, the similarity between files increases unpredictably with smaller chunk sizes, even for data of the same type. Existing systems often pick one chunk size that is ``good enough'' for many cases because they lack efficient techniques to determine the benefits at other chunk sizes. This paper addresses this deficiency via two contributions: (1) we present multi-resolution (MR) handprinting, an application-independent technique that efficiently estimates similarity between data items at different chunk sizes using a compact, multi-size representation of the data; (2) we then evaluate the application of MR handprints to workloads from peer-to-peer, file transfer, and storage systems, demonstrating that the chunk size selection enabled by MR handprints can lead to real improvements over using a fixed chunk size in these systems.

[1]  David Wetherall,et al.  A protocol-independent technique for eliminating redundant network traffic , 2000, SIGCOMM.

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Nikolaj Bjørner,et al.  Optimizing File Replication over Limited-Bandwidth Networks using Remote Differential Compression , 2006 .

[4]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[5]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[6]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[7]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[8]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[9]  B. Cohen,et al.  Incentives Build Robustness in Bit-Torrent , 2003 .

[10]  Himabindu Pucha,et al.  Adaptive File Transfers for Diverse Environments , 2008, USENIX Annual Technical Conference.

[11]  Windsor W. Hsu,et al.  Duplicate Management for Reference Data , 2004 .

[12]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[13]  KyoungSoo Park,et al.  Supporting Practical Content-Addressable Caching with CZIP Compression , 2007, USENIX Annual Technical Conference.

[14]  David G. Andersen,et al.  An Architecture for Internet Data Transfer , 2006, NSDI.

[15]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[16]  Himabindu Pucha,et al.  Exploiting Similarity for Multi-Source Downloads Using File Handprints , 2007, NSDI.

[17]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[18]  Eric A. Brewer,et al.  Value-based web caching , 2003, WWW '03.

[19]  Paul Mackerras,et al.  The rsync algorithm , 1996 .

[20]  Siddhartha Annapureddy,et al.  Shark: scaling file servers via cooperative caching , 2005, NSDI.