Distribution-free data density estimation in large-scale networks

Estimating the global data distribution in large-scale networks is an important issue and yet to be well addressed. It can benefit many applications, especially in the cloud computing era, such as load balancing analysis, query processing, and data mining. Inspired by the inversion method for random variate (number) generation, in this paper, we present a novel model called distribution-free data density estimation for large ring-based networks to achieve high estimation accuracy with low estimation cost regardless of the distribution models of the underlying data. This model generates random samples for any arbitrary distribution by sampling the global cumulative distribution function and is free from sampling bias. Armed with this estimation method, we can estimate data densities over both one-dimensional and multidimensional tuple sets, where each dimension could be either continuous or discrete as its domain. In large-scale networks, the key idea for distribution-free estimation is to sample a small subset of peers for estimating the global data distribution over the data domain. Algorithms on computing and sampling the global cumulative distribution function based on which the global data distribution is estimated are introduced with a detailed theoretical analysis. Our extensive performance study confirms the effectiveness and efficiency of our methods in large ring-based networks.

[1]  Jared Saia,et al.  Choosing a random peer , 2004, PODC '04.

[2]  Aoying Zhou,et al.  Effective Data Density Estimation in Ring-Based P2P Networks , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Theoni Pitoura,et al.  Load Distribution Fairness in P2P Data Management Systems , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  L. M. M.-T. Theory of Probability , 1929, Nature.

[5]  Hua Chen,et al.  Distributed Density Estimation Using Non-parametric Statistics , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[6]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[7]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[8]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[9]  Beng Chin Ooi,et al.  Supporting multi-dimensional range queries in peer-to-peer systems , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[10]  Beng Chin Ooi,et al.  BATON: A Balanced Tree Structure for Peer-to-Peer Networks , 2005, VLDB.

[11]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[12]  M. Evans Statistical Distributions , 2000 .

[13]  Deborah Estrin,et al.  An architecture for wide-area multicast routing , 1994, SIGCOMM 1994.

[14]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[15]  Ralf Steinmetz,et al.  Sampling cluster endurance for peer-to-peer based content distribution networks , 2007, Multimedia Systems.

[16]  Dimitrios Gunopulos,et al.  Approximating Aggregation Queries in Peer-to-Peer Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[18]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[19]  Dimitrios Gunopulos,et al.  Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[20]  Beng Chin Ooi,et al.  Just-in-time query retrieval over partially indexed data on structured P2P overlays , 2008, SIGMOD Conference.

[21]  Yiming Hu,et al.  Towards efficient load balancing in structured P2P systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[22]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.

[23]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM '04.

[24]  Paul E. Pfeiffer Probability for Applications , 1989 .

[25]  P. E. Pfeiffer,et al.  Probability for applications , by P. E. Pfeiffer. Pp 695. DM118. 1990. ISBN 3-540-97138-6 (Springer) , 1991 .

[26]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[27]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM 2004.

[28]  Anthony K. H. Tung,et al.  Efficient Skyline Query Processing on Peer-to-Peer Networks , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[29]  S. Seshadri Probabilistic methods in query processing , 1992 .

[30]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[31]  Grace Wahba,et al.  A Polynomial Algorithm for Density Estimation , 1971 .

[32]  Ilkka Norros,et al.  Stable, distributed P2P protocols based on random peer sampling , 2012, Allerton Conference.

[33]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[34]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[35]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[36]  Aoying Zhou,et al.  Multi-dimensional data density estimation in P2P networks , 2009, Distributed and Parallel Databases.

[37]  Anne-Marie Kermarrec,et al.  Gossip-based peer sampling , 2007, TOCS.

[38]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.