On the Feasibility of Peer-to-Peer Web Indexing and Search

This paper discusses the feasibility of peer-to-peer full-text keyword search of the Web. Two classes of keyword search techniques are in use or have been proposed: flooding of queries over an overlay network (as in Gnutella), and intersection of index lists stored in a distributed hash table. We present a simple feasibility analysis based on the resource constraints and search workload. Our study suggests that the peer-to-peer network does not have enough capacity to make naive use of either of search techniques attractive for Web search. The paper presents a number of existing and novel optimizations for P2P search based on distributed hash tables, estimates their effects on performance, and concludes that in combination these optimizations would bring the problem to within an order of magnitude of feasibility. The paper suggests a number of compromises that might achieve the last order of magnitude.

[1]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[2]  R. Wilder,et al.  Wide-area Internet traffic patterns and characteristics , 1997, IEEE Netw..

[3]  Peter J. Haas,et al.  Interactive data Analysis: The Control Project , 1999, Computer.

[4]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[5]  Peter J. Haas,et al.  In-teractive data analysis with CONTROL , 1999 .

[6]  Ian H. Witten,et al.  Man-aging Gigabytes: Compressing and Indexing Docu-ments and Images , 1999 .

[7]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[10]  Martin Bergman,et al.  The deep web:surfacing the hidden value , 2000 .

[11]  Erik D. Demaine,et al.  Adaptive set intersections, unions, and differences , 2000, SODA '00.

[12]  M. Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[13]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[14]  J. Ritter Why Gnutella Can't Scale. No, Really , 2001 .

[15]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[16]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[17]  Robert Morris,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM 2001.

[18]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[19]  Michael Mitzenmacher,et al.  Compressed bloom filters , 2001, PODC '01.

[20]  Scott Shenker,et al.  Complex Queries in Dht-based Peer-to-peer Networks , 2002 .

[21]  Omprakash D. Gnawali A Keyword-Set Search System for Peer-to-Peer Networks , 2002 .

[22]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[23]  Zhichen Xu,et al.  pSearch: information retrieval in structured overlays , 2003, CCRV.