Text-Based Content Search and Retrieval in Ad-hoc P2P Communities

We consider the problem of content search and retrieval in peer-to-peer (P2P) communities. P2P computing is a potentially powerful model for information sharing between ad hoc groups of users because of its low cost of entry and natural model for resource scaling. As P2P communities grow, however, locating information distributed across the large number of peers becomes problematic. We address this problem by adapting a state-of-the-art text-based document ranking algorithm, the vector-space model instantiated with the TFxIDF ranking rule, to the P2P environment. We make three contributions: (a) we show how to approximate TFxIDF using compact summaries of individual peers' inverted indexes rather than the inverted index of the entire communal store; (b) we develop a heuristic for adaptively determining the set of peers that should be contacted for a query; and (c) we show that our algorithm tracks TFxIDF's performance very closely, giving P2P communities a search and retrieval algorithm as good as that possible assuming a centralized server.

[1]  Andy Oram,et al.  Peer-to-Peer: Harnessing the Power of Disruptive Technologies , 2001 .

[2]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[3]  Richard P. Martin,et al.  PlanetP: Infrastructure Support for P2P Information Sharing , 2001 .

[4]  Luis Gravano,et al.  The Effectiveness of GlOSS for the Text Database Discovery Problem , 1994, SIGMOD Conference.

[5]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[6]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[7]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[8]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[9]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[10]  Andrew B. Whinston,et al.  P2P Networking: An Information-Sharing Alternative , 2001, Computer.

[11]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[12]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[13]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[14]  Donna K. Harman,et al.  Overview of the first TREC conference , 1993, SIGIR.

[15]  Stefan Saroiu,et al.  A Measurement Study of Peer-to-Peer File Sharing Systems , 2001 .

[16]  Steven R. Waterhouse Jxta search:distributed search for distributed networks , 2001 .

[17]  Thomas E. Anderson,et al.  A Comparison of File System Workloads , 2000, USENIX Annual Technical Conference, General Track.

[18]  Ben Y. Zhao,et al.  Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and , 2001 .

[19]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[20]  Chris Buckley,et al.  Implementation of the SMART Information Retrieval System , 1985 .

[21]  Pierre Jouvelot,et al.  Semantic file systems , 1991, SOSP '91.

[22]  Mor Harchol-Balter,et al.  Resource discovery in distributed networks , 1999, PODC '99.

[23]  Hector Garcia-Molina,et al.  Improving search in peer-to-peer networks , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[24]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[25]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[26]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[27]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[28]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[29]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[30]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[31]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[32]  Hector Garcia-Molina,et al.  Efficient search in peer to peer networks , 2004 .

[33]  James F. Doyle,et al.  Peer-to-Peer: harnessing the power of disruptive technologies , 2001, UBIQ.