Unified Framework for Flexible and Efficient Top-k Retrieval in Peer-to-Peer Networks

As more and more data from distributed data sources becomes accessible, supporting queries over peer-topeer networks of such data sources becomes a more convincing application scenario. In such an application scenario, a large scale of accessible data from multiple peers naturally calls for ranked retrieval in order to effectively focus the retrieval on the most relevant, say top-k results. While top-k retrieval has been actively studied lately, existing algorithms are too restrictive due to their assumptions about the predicates and scoring functions used. These restrictive assumptions limit the flexibility of individual users to issue personalized queries. In contrast, we present efficient algorithms that support top-k retrieval customized to the specific predicates and scoring functions desired by the users. Also, unlike existing approaches that only consider a single type of data partitioning, we generalize the application scenario to include peer-to-peer networks of a potentially large number of peers that might partition the data in various ways. More specifically, we develop a unified top-k query processing framework to cover the following types of data partitioning: (1) vertical partitioning where each peer stores partial scores of an identical set of data objects, (2) horizontal partitioning where each peer stores complete scores of a disjoint set of data objects, and (3) mixed partitioning where each peer stores partial scores of a disjoint set of data objects. In particular, we customize queries from users by transforming data synopses on a per-query basis. We also reduce bandwidth consumption by using heuristics to schedule the order in which predicates are evaluated. Our results validate the efficiency and effectiveness of our framework by considering the bandwidth consumption, delay, and correctness of our algorithms.

[1]  Antony I. T. Rowstron,et al.  Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility , 2001, SOSP.

[2]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[3]  Seung-won Hwang,et al.  Minimal probing: supporting expensive predicates for top-k queries , 2002, SIGMOD '02.

[4]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[5]  Michael Stonebraker,et al.  Predicate migration: optimizing queries with expensive predicates , 1992, SIGMOD Conference.

[6]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[7]  Indranil Gupta,et al.  Preventing DoS attacks in peer-to-peer media streaming systems , 2006, Electronic Imaging.

[8]  Amin Vahdat,et al.  MediSyn: a synthetic streaming media service workload generator , 2003, NOSSDAV '03.

[9]  Amin Vahdat,et al.  Efficient Peer-to-Peer Keyword Searching , 2003, Middleware.

[10]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[11]  David R. Karger,et al.  Wide-area cooperative storage with CFS , 2001, SOSP.

[12]  Wolf-Tilo Balke,et al.  Progressive distributed top-k retrieval in peer-to-peer networks , 2005, 21st International Conference on Data Engineering (ICDE'05).

[13]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[15]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[16]  Miguel Castro,et al.  SplitStream: high-bandwidth multicast in cooperative environments , 2003, SOSP '03.

[17]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[18]  David Mazières,et al.  Democratizing Content Publication with Coral , 2004, NSDI.

[19]  Li Fan,et al.  Summary cache: a scalable wide-area Web cache sharing protocol , 1998, SIGCOMM '98.