Distinct value estimation on peer-to-peer networks

Peer-to-Peer networks have become very popular on the Internet, with millions of peers all over the world sharing large volumes of data. In the assistive healthcare sector, it is likely that P2P networks will develop that interconnect and allow the controlled sharing of patient databases of various hospitals, clinics, and research laboratories. However, the sheer scale of these networks has made it difficult to gather statistics that could be used for building new features. In this paper, we present a technique to obtain estimations of the number of distinct values matching a query on the network. We evaluate the technique experimentally and provide a set of results that demonstrate its effectiveness, as well as its flexibility in supporting a variety of queries and applications.

[1]  Matei Ripeanu,et al.  Peer-to-peer architecture case study: Gnutella network , 2001, Proceedings First International Conference on Peer-to-Peer Computing.

[2]  Diomidis Spinellis,et al.  A survey of peer-to-peer content distribution technologies , 2004, CSUR.

[3]  Jeffrey F. Naughton,et al.  Practical selectivity estimation through adaptive sampling , 1990, SIGMOD '90.

[4]  Gerhard Weikum,et al.  Counting at Large: Efficient Cardinality Estimation in Internet-Scale Data Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[5]  Surajit Chaudhuri,et al.  Effective use of block-level sampling in statistics estimation , 2004, SIGMOD '04.

[6]  Anne-Marie Kermarrec,et al.  Clustering in Peer-to-Peer File Sharing Workloads , 2004, IPTPS.

[7]  Peter J. Haas,et al.  Sequential sampling procedures for query size estimation , 1992, SIGMOD '92.

[8]  Dimitrios Gunopulos,et al.  Approximating Aggregation Queries in Peer-to-Peer Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[9]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[10]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[11]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[12]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[13]  Christos Gkantsidis,et al.  Random walks in peer-to-peer networks: Algorithms and evaluation , 2006, Perform. Evaluation.

[14]  Hector Garcia-Molina,et al.  Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems , 2004, VLDB.

[15]  Walter Willinger,et al.  On Unbiased Sampling for Unstructured Peer-to-Peer Networks , 2006, IEEE/ACM Transactions on Networking.

[16]  Christos Gkantsidis,et al.  Random walks in peer-to-peer networks , 2004, IEEE INFOCOM 2004.

[17]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[18]  R. Beaver,et al.  Estimation of the Number of Classes in a Population , 2007 .

[19]  Surajit Chaudhuri,et al.  Dynamic sample selection for approximate query processing , 2003, SIGMOD '03.

[20]  Ming Zhong,et al.  Random walk based node sampling in self-organizing networks , 2006, OPSR.

[21]  Wen-Chi Hou,et al.  Error-constrained COUNT query evaluation in relational databases , 1991, SIGMOD '91.

[22]  Suresh Jagannathan,et al.  Distributed Uniform Sampling in Unstructured Peer-to-Peer Networks , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[23]  Jared Saia,et al.  Choosing a random peer , 2004, PODC '04.

[24]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[25]  K. Burnham,et al.  Robust Estimation of Population Size When Capture Probabilities Vary Among Animals , 1979 .

[26]  Henning Schulzrinne,et al.  An Analysis of the Skype Peer-to-Peer Internet Telephony Protocol , 2004, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[27]  Edith Cohen,et al.  Search and replication in unstructured peer-to-peer networks , 2002, ICS '02.

[28]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[29]  Dimitrios Tsoumakos,et al.  A Comparison of Peer-to-Peer Search Methods , 2003, WebDB.

[30]  Wen-Chi Hou,et al.  Statistical estimators for relational algebra expressions , 1988, PODS '88.

[31]  Rajeev Motwani,et al.  Overcoming limitations of sampling for aggregation queries , 2001, Proceedings 17th International Conference on Data Engineering.

[32]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[33]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.