Approximate Matching for Peer-to-Peer Overlays with Cubit

Keyword search is a critical component in most content retrieval systems. Despite the emergence of completely decentralized and efficient peer-to-peer techniques for content distribution, there have not been similarly efficient, accurate, and decentralized mechanisms for content discovery based on approximate search keys. In this paper, we present a scalable and efficient peer-to-peer system called Cubit with a new search primitive that can efficiently find the k data items with keys most similar to a given search key. The system works by creating a keyword metric space that encompasses both the nodes and the objects in the system, where the distance between two points is a measure of the similarity between the strings that the points represent. It provides a loosely-structured overlay that can efficiently navigate this space. We evaluate Cubit through both a real deployment as a search plugin for a popular BitTorrent client and a large-scale simulation and show that it provides an efficient, accurate and robust method to handle imprecise string search in filesharing applications.

[1]  Alfred O. Hero,et al.  Image registration in high-dimensional feature space , 2005, IS&T/SPIE Electronic Imaging.

[2]  Pierre Fraigniaud,et al.  A Doubling Dimension Threshold Theta(loglogn) for Augmented Graph Navigability , 2006, ESA.

[3]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[4]  Raouf Boutaba,et al.  Distributed pattern matching: a key to flexible and efficient P2P search , 2007, IEEE Journal on Selected Areas in Communications.

[5]  Andreas Haeberlen,et al.  PeerReview: practical accountability for distributed systems , 2007, SOSP.

[6]  Ralph C. Merkle,et al.  Secure communications over insecure channels , 1978, CACM.

[7]  Miguel Castro,et al.  Secure routing for structured peer-to-peer overlay networks , 2002, OSDI '02.

[8]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[9]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  Manish Parashar,et al.  Flexible information discovery in decentralized distributed systems , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[12]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[13]  Raouf Boutaba,et al.  Distributed Pattern Matching for P2P Systems , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[14]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[15]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[16]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[17]  Moni Naor,et al.  Viceroy: a scalable and dynamic emulation of the butterfly , 2002, PODC '02.

[18]  Doug Terry,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[19]  John R. Douceur,et al.  The Sybil Attack , 2002, IPTPS.

[20]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[21]  Emin Gün Sirer,et al.  Experience with an Object Reputation System for Peer-to-Peer Filesharing , 2006, NSDI.

[22]  Sandhya Dwarkadas,et al.  Peer-to-peer information retrieval using self-organizing semantic overlay networks , 2003, SIGCOMM '03.

[23]  David E. Culler,et al.  Operating Systems Support for Planetary-Scale Network Services , 2004, NSDI.

[24]  Farnoush Banaei Kashani,et al.  SWAM: a family of access methods for similarity-search in peer-to-peer data networks , 2004, CIKM '04.

[25]  Emin Gün Sirer,et al.  Meridian: a lightweight network location service without virtual coordinates , 2005, SIGCOMM '05.

[26]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[27]  Brighten Godfrey,et al.  OpenDHT: a public DHT service and its uses , 2005, SIGCOMM '05.

[28]  Pierre Fraigniaud,et al.  A doubling dimension threshold Θ(log log n) for augmented graph navigability , 2006 .

[29]  Robert Tappan Morris,et al.  Vivaldi: a decentralized network coordinate system , 2004, SIGCOMM '04.

[30]  Krishna P. Gummadi,et al.  Measurement, modeling, and analysis of a peer-to-peer file-sharing workload , 2003, SOSP '03.

[31]  Jon M. Kleinberg,et al.  The small-world phenomenon: an algorithmic perspective , 2000, STOC '00.

[32]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[33]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[34]  Johannes Gehrke,et al.  Querying peer-to-peer networks using P-trees , 2004, WebDB '04.

[35]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[36]  David Mazières,et al.  Kademlia: A Peer-to-Peer Information System Based on the XOR Metric , 2002, IPTPS.

[37]  David R. Karger,et al.  Finding nearest neighbors in growth-restricted metrics , 2002, STOC '02.

[38]  Hui Zhang,et al.  Predicting Internet network distance with coordinates-based approaches , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[39]  Michael Schumacher,et al.  Extending peer-to-peer networks for approximate search , 2008, SAC '08.

[40]  Gurmeet Singh Manku,et al.  Symphony: Distributed Hashing in a Small World , 2003, USENIX Symposium on Internet Technologies and Systems.

[41]  Ernesto Damiani,et al.  A reputation-based approach for choosing reliable resources in peer-to-peer networks , 2002, CCS '02.

[42]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM '04.

[43]  Stefan Saroiu,et al.  Finding Content in File-Sharing Networks When You Can't Even Spell , 2007, IPTPS.

[44]  Emin Gün Sirer,et al.  Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays , 2004, NSDI.

[45]  Emin Gün Sirer,et al.  Hyperspaces for Object Clustering and Approximate Matching in Peer-to-Peer Overlays , 2007, HotOS.

[46]  David R. Karger,et al.  Koorde: A Simple Degree-Optimal Distributed Hash Table , 2003, IPTPS.

[47]  Sandhya Dwarkadas,et al.  On scaling latent semantic indexing for large peer-to-peer systems , 2004, SIGIR '04.