Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices

Peer-to-Peer (P2P) search requires intelligent decisions for query routing: selecting the best peers to which a given query, initiated at some peer, should be forwarded for retrieving additional search results. These decisions are based on statistical summaries for each peer, which are usually organized on a per-keyword basis and managed in a distributed directory of routing indices. Such architectures disregard the possible correlations among keywords. Together with the coarse granularity of per-peer summaries, which are mandated for scalability, this limitation may lead to poor search result quality.This paper develops and evaluates two solutions to this problem, sk-STAT based on single-key statistics only, and mk-STAT based on additional multi-key statistics. For both cases, hash sketch synopses are used to compactly represent a peer's data items and are efficiently disseminated in the P2P network to form a decentralized directory. Experimental studies with Gnutella and Web data demonstrate the viability and the trade-offs of the approaches.

[1]  Hector Garcia-Molina,et al.  Semantic Overlay Networks for P2P Systems , 2004, AP2PC.

[2]  Gerhard Weikum,et al.  P2P Content Search: Give the Web Back to the People , 2006, IPTPS.

[3]  Bruce M. Maggs,et al.  Efficient content location using interest-based locality in peer-to-peer systems , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[4]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[5]  Peter Triantafillou,et al.  Towards High Performance Peer-to-Peer Content and Resource Sharing Systems , 2003, CIDR.

[6]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[7]  Márk Jelasity,et al.  An approach to massively distributed aggregate computing on peer-to-peer networks , 2004, 12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2004. Proceedings..

[8]  Karl Aberer,et al.  Semantic Overlay Networks , 2005, VLDB.

[9]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[10]  Jie Lu,et al.  Content-based retrieval in hybrid peer-to-peer networks , 2003, CIKM '03.

[11]  Richard M. Karp,et al.  Randomized rumor spreading , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[12]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[13]  Karl Aberer,et al.  Building a peer-to-peer full-text Web search engine with highly discriminative keys , 2005 .

[14]  Karl Aberer,et al.  P-Grid: A Self-Organizing Access Structure for P2P Information Systems , 2001, CoopIS.

[15]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[16]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[17]  Norbert Fuhr,et al.  A decision-theoretic approach to database selection in networked IR , 1999, TOIS.

[18]  Subbarao Kambhampati,et al.  Improving text collection selection with coverage and overlap statistics , 2005, WWW '05.

[19]  Torsten Suel,et al.  Efficient query evaluation on large textual collections in a peer-to-peer environment , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[20]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[21]  Peter J. Haas,et al.  Consistently Estimating the Selectivity of Conjuncts of Predicates , 2005, VLDB.

[22]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[23]  Edith Cohen,et al.  Associative search in peer to peer networks: harnessing latent semantics , 2003, IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428).

[24]  Gerhard Weikum,et al.  IQN Routing: Integrating Quality and Novelty in P2P Querying and Ranking , 2006, EDBT.

[25]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[26]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[27]  Gerhard Weikum,et al.  MINERVA: Collaborative P2P Search , 2005, VLDB.

[28]  Donna K. Harman,et al.  The Text REtrieval Conference (TREC) , 1999, NTCIR.

[29]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[30]  G. Weikum,et al.  IQN Routing: Integrating Quality and Novelty for Web Search , 2006 .

[31]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[32]  Jamie Callan,et al.  DISTRIBUTED INFORMATION RETRIEVAL , 2002 .

[33]  Divyakant Agrawal,et al.  Attribute-based access to distributed data over P2P networks , 2007, Int. J. Comput. Sci. Eng..

[34]  Anne-Marie Kermarrec,et al.  Exploiting semantic proximity in peer-to-peer content searching , 2004, Proceedings. 10th IEEE International Workshop on Future Trends of Distributed Computing Systems, 2004. FTDCS 2004..

[35]  Luo Si,et al.  A language modeling framework for resource selection and results merging , 2002, CIKM '02.

[36]  Márk Jelasity,et al.  Gossip-based aggregation in large dynamic networks , 2005, TOCS.

[37]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[38]  Scott Shenker,et al.  The Architecture of PIER: an Internet-Scale Query Processor , 2005, CIDR.

[39]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[40]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[41]  Beng Chin Ooi,et al.  Answering similarity queries in peer-to-peer networks , 2004, WWW Alt. '04.

[42]  Witold Litwin,et al.  k-RP*s: a scalable distributed data structure for high-performance multi-attribute access , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[43]  Hector Garcia-Molina,et al.  Evaluating GUESS and non-forwarding peer-to-peer search , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[44]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[45]  Gerhard Weikum,et al.  Improving collection selection with overlap awareness in P2P search engines , 2005, SIGIR '05.