Exploiting Similarity for Multi-Source Downloads Using File Handprints

Many contemporary approaches for speeding up large file transfers attempt to download chunks of a data object from multiple sources. Systems such as BitTorrent quickly locate sources that have an exact copy of the desired object, but they are unable to use sources that serve similar but non-identical objects. Other systems automatically exploit cross-file similarity by identifying sources for each chunk of the object. These systems, however, require a number of lookups proportional to the number of chunks in the object and a mapping for each unique chunk in every identical and similar object to its corresponding sources. Thus, the lookups and mappings in such a system can be quite large, limiting its scalability. This paper presents a hybrid system that provides the best of both approaches, locating identical and similar sources for data objects using a constant number of lookups and inserting a constant number of mappings per object. We first demonstrate through extensive data analysis that similarity does exist among objects of popular file types, and that making use of it can sometimes substantially improve download times. Next, we describe handprinting, a technique that allows clients to locate similar sources using a constant number of lookups and mappings. Finally, we describe the design, implementation and evaluation of Similarity-Enhanced Transfer (SET), a system that uses this technique to download objects. Our experimental evaluation shows that by using sources of similar objects, SET is able to significantly out-perform an equivalently configured BitTorrent.

[1]  David E. Culler,et al.  A blueprint for introducing disruptive technology into the Internet , 2003, CCRV.

[2]  Sean Quinlan,et al.  Venti: A New Approach to Archival Storage , 2002, FAST.

[3]  Brian D. Noble,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Pastiche: Making Backup Cheap and Easy , 2022 .

[4]  John Kubiatowicz,et al.  ChunkCast: An Anycast Service for Large Content Distribution , 2006, IPTPS.

[5]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OPSR.

[6]  Miguel Castro,et al.  SplitStream: high-bandwidth multicast in cooperative environments , 2003, SOSP '03.

[7]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[8]  David G. Andersen,et al.  An Architecture for Internet Data Transfer , 2006, NSDI.

[9]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[10]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[11]  KyoungSoo Park,et al.  Scale and Performance in the CoBlitz Large-File Distribution Service , 2006, NSDI.

[12]  Rakesh Kumar,et al.  Pollution in P2P file sharing systems , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[13]  Christos Gkantsidis,et al.  Network coding for large scale content distribution , 2005, Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies..

[14]  Rob Sherwood,et al.  Slurpie: a cooperative bulk data transfer protocol , 2004, IEEE INFOCOM 2004.

[15]  B. Cohen,et al.  Incentives Build Robustness in Bit-Torrent , 2003 .

[16]  Zhe Wang,et al.  CoDNS: Improving DNS Performance and Reliability via Cooperative Lookups , 2004, OSDI.

[17]  KyoungSoo Park,et al.  Deploying Large File Transfer on an HTTP Content Distribution Network , 2004, WORLDS.

[18]  Krishna P. Gummadi,et al.  Measurement, modeling, and analysis of a peer-to-peer file-sharing workload , 2003, SOSP '03.

[19]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[20]  Ian Pratt,et al.  Proceedings of the General Track: 2004 USENIX Annual Technical Conference , 2004 .

[21]  Terence Kelly,et al.  Design, Implementation, and Evaluation of Duplicate Transfer Detection in HTTP , 2004, NSDI.

[22]  Siddhartha Annapureddy,et al.  Shark: scaling file servers via cooperative caching , 2005, NSDI.

[23]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[24]  Guillaume Urvoy-Keller,et al.  Rarest first and choke algorithms are enough , 2006, IMC '06.

[25]  Amin Vahdat,et al.  Maintaining High-Bandwidth Under Dynamic Network Conditions , 2005, USENIX Annual Technical Conference, General Track.

[26]  Miguel Castro,et al.  SplitStream: High-Bandwidth Content Distribution in Cooperative Environments , 2003, IPTPS.

[27]  Windsor W. Hsu,et al.  Duplicate Management for Reference Data , 2004 .

[28]  Fred Douglis,et al.  USENIX Association Proceedings of the General Track : 2003 USENIX Annual , 2003 .

[29]  Mahadev Satyanarayanan,et al.  Opportunistic Use of Content Addressable Storage for Distributed File Systems , 2003, USENIX Annual Technical Conference, General Track.

[30]  Brighten Godfrey,et al.  OpenDHT: a public DHT service and its uses , 2005, SIGCOMM '05.

[31]  Ben Y. Zhao,et al.  Approximate Object Location and Spam Filtering on Peer-to-Peer Systems , 2003, Middleware.