Towards Self-Organizing Query Routing and Processing for Peer-to-Peer Web Search

The peer-to-peer computing paradigm is an intriguing alternative to Google-style search engines for querying and ranking Web content. In a network with many thousands or millions of peers the storage and access load requirements per peer are much lighter than for a centralized Google-like server farm; thus more powerful techniques from information retrieval, statistical learning, computational linguistics, and ontological reasoning can be employed on each peer’s local search engine for boosting the quality of search results. In addition, peers can dynamically collaborate on advanced and particularly difficult queries. Moroever, a peer-to-peer setting is ideally suited to capture local user behavior, like query logs and click streams, and disseminate and aggregate this information in the network, at the discretion of the corresponding user, in order to incorporate richer cognitive models. This paper gives an overview of ongoing work in the EU Integrated Project DELIS that aims to develop foundations for a peer-to-peer search engine with Google-or-better scale, functionality, and quality, which will operate in a completely decentralized and self-organizing manner. The paper presents the architecture of such a system and the Minerva prototype testbed, and it discusses various core pieces of the approach: efficient execution of top-k ranking queries, strategies for query routing when a search request needs to be forwarded to other peers, maintaining a self-organizing semantic overlay network, and exploiting and coping with user and community behavior. ∗ The authors are with the Max-Planck Institute for Computer Science in Saarbruecken, Telenor in Oslo, the University of Bologna, the Heinz-Nixdorf Institute in Paderborn, and the University of Patras. The work presented in this paper is partially supported by the EU within the 6th Framework Programme under contract 001907 “Dynamically Evolving, Large Scale Information Systems” (DELIS).

[1]  Gerhard Weikum,et al.  Towards a Statistically Semantic Web , 2004, ER.

[2]  Hector Garcia-Molina,et al.  SLIC: a selfish link-based incentive mechanism for unstructured peer-to-peer networks , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[3]  David Hales,et al.  From selfish nodes to cooperative networks - emergent link-based incentives in peer-to-peer networks , 2004, Proceedings. Fourth International Conference on Peer-to-Peer Computing, 2004. Proceedings..

[4]  Christian Schindelhauer,et al.  Peer-to-peer networks based on random transformations of connected regular undirected graphs , 2005, SPAA '05.

[5]  Gerhard Weikum,et al.  Improving collection selection with overlap awareness in P2P search engines , 2005, SIGIR '05.

[6]  Peter Triantafillou,et al.  SeAl: managing accesses and data in peer-to-peer sharing networks , 2004, Proceedings. Fourth International Conference on Peer-to-Peer Computing, 2004. Proceedings..

[7]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[8]  Ji-Rong Wen,et al.  Query clustering using user logs , 2002, TOIS.

[9]  Gerhard Weikum,et al.  The BINGO! System for Information Portal Generation and Expert Web Search , 2003, CIDR.

[10]  John Kubiatowicz,et al.  Extracting guarantees from chaos , 2003, CACM.

[11]  Mark Buchanan,et al.  Nexus: Small Worlds and the Groundbreaking Science of Networks , 2002 .

[12]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[13]  Gerhard Weikum,et al.  JXP: Global Authority Scores in a P2P Network , 2005, WebDB.

[14]  Albert-László Barabási,et al.  Linked: The New Science of Networks , 2002 .

[15]  Edith Cohen,et al.  Associative search in peer to peer networks: harnessing latent semantics , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[16]  R. Riolo,et al.  Evolution of cooperation without reciprocity , 2001, Nature.

[17]  B. Cohen,et al.  Incentives Build Robustness in Bit-Torrent , 2003 .

[18]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[19]  John Kubiatowicz,et al.  Handling churn in a DHT , 2004 .

[20]  Richard M. Karp,et al.  Randomized rumor spreading , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[21]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[22]  Karl Aberer,et al.  GridVine: Building Internet-Scale Semantic Overlay Networks , 2004, SEMWEB.

[23]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[24]  Luc Steels,et al.  The Evolution of Communication Systems by Adaptive Agents , 2002, Adaptive Agents and Multi-Agents Systems.

[25]  Gerhard Weikum,et al.  Efficient and self-tuning incremental query expansion for top-k query processing , 2005, SIGIR '05.

[26]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[27]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[28]  Gerhard Weikum,et al.  Improving Collection Selection with Overlap-Awareness , 2005 .

[29]  Heiner Stuckenschmidt,et al.  Handbook on Ontologies , 2004, Künstliche Intell..

[30]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[31]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[32]  Kenneth P. Birman,et al.  The Surprising Power of Epidemic Communication , 2003, Future Directions in Distributed Computing.

[33]  Thomas Hofmann,et al.  Latent semantic models for collaborative filtering , 2004, TOIS.

[34]  Gerhard Weikum,et al.  Bookmark-driven Query Routing in Peer-to-Peer Web Search , 2005, Workshop on Peer-to-Peer Information Retrieval.

[35]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[36]  David R. Karger,et al.  Chord: a scalable peer-to-peer lookup protocol for internet applications , 2003, TNET.

[37]  Krishna P. Gummadi,et al.  Measuring and analyzing the characteristics of Napster and Gnutella hosts , 2003, Multimedia Systems.

[38]  David Hales,et al.  Cooperation without Memory or Space: Tags, Groups and the Prisoner's Dilemma , 2000, MABS.

[39]  Larry Wasserman,et al.  All of Statistics , 2004 .

[40]  Peter Triantafillou,et al.  SeAl: managing accesses and data in peer-to-peer sharing networks , 2004 .

[41]  Ingmar Weber,et al.  Insights from Viewing Ranked Retrieval as Rank Aggregation , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[42]  Peter Triantafillou,et al.  AESOP: Altruism-Endowed Self-organizing Peers , 2004, DBISP2P.

[43]  Frédéric Amblard,et al.  Nexus: Small Worlds and the Groundbreaking Science of Networks by Mark Buchanan , 2003, J. Artif. Soc. Soc. Simul..

[44]  Wei-Ying Ma,et al.  Query Expansion by Mining User Logs , 2003, IEEE Trans. Knowl. Data Eng..

[45]  Karl Aberer,et al.  Databases, Information Systems, and Peer-to-Peer Computing , 2003, Lecture Notes in Computer Science.

[46]  Scott Shenker,et al.  Making gnutella-like P2P systems scalable , 2003, SIGCOMM '03.

[47]  Bryce Wilcox-O ' Hearn Experiences Deploying a Large-Scale Emergent Network , 2002 .

[48]  Norbert Fuhr,et al.  Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection , 2004, ECIR.

[49]  David R. Karger,et al.  Analysis of the evolution of peer-to-peer systems , 2002, PODC '02.

[50]  Carl D. Meyer,et al.  Deeper Inside PageRank , 2004, Internet Math..

[51]  Jie Lu,et al.  Content-based retrieval in hybrid peer-to-peer networks , 2003, CIKM '03.

[52]  D. Watts The “New” Science of Networks , 2004 .

[53]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[54]  Soumen Chakrabarti,et al.  Breaking Through the Syntax Barrier: Searching with Entities and Relations , 2004, ECML.

[55]  Gerhard Weikum,et al.  Query-Log Based Authority Analysis for Web Information Search , 2004, WISE.

[56]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[57]  YuClement,et al.  Building efficient and effective metasearch engines , 2002 .

[58]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[59]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[60]  Joan Feigenbaum,et al.  Distributed algorithmic mechanism design: recent results and future directions , 2002, DIALM '02.

[61]  Debapriyo Majumdar,et al.  Why spectral retrieval works , 2005, SIGIR '05.

[62]  Christian Schindelhauer,et al.  Weighted distributed hash tables , 2005, SPAA '05.

[63]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[64]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[65]  Hector Garcia-Molina,et al.  Semantic Overlay Networks for P2P Systems , 2004, AP2PC.

[66]  David R. Karger,et al.  Observations on the Dynamic Evolution of Peer-to-Peer Networks , 2002, IPTPS.