Multi-dimensional data density estimation in P2P networks

Estimating the global data distribution in Peer-to-Peer (P2P) networks is an important issue and has not yet been well addressed. It can benefit many P2P applications, such as load balancing analysis, query processing, data mining, and so on. In this paper, we propose a novel algorithm which is based on compact multi-dimensional histogram information to achieve high estimation accuracy with low estimation cost. Maintaining data distribution in a multi-dimensional histogram which is spread among peers without overlapping and each part of which is further condensed by a set of discrete cosine transform coefficients, each peer is capable to hierarchically accumulate the compact information to the entire histogram by information exchange and consequently estimates the global data density with accuracy and efficiency. Algorithms on discrete cosine transform coefficients hierarchically accumulating as well as density estimation error are introduced with detailed theoretical analysis and proof. Our extensive performance study confirms the effectiveness and efficiency of our methods on density estimation in dynamic P2P networks.

[1]  Deok-Hwan Kim,et al.  Multi-dimensional selectivity estimation using compressed histogram information , 1999, SIGMOD '99.

[2]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[3]  H. Jeffreys,et al.  Theory of probability , 1896 .

[4]  Hua Chen,et al.  Distributed Density Estimation Using Non-parametric Statistics , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[5]  Luis Gravano,et al.  STHoles: a multidimensional workload-aware histogram , 2001, SIGMOD '01.

[6]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[7]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[8]  S. Seshadri Probabilistic methods in query processing , 1992 .

[9]  Dimitrios Gunopulos,et al.  Approximating Aggregation Queries in Peer-to-Peer Networks , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  W. D. Ray,et al.  Further decomposition of the Karhunen-Loève series representation of a stationary random process , 1970, IEEE Trans. Inf. Theory.

[11]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[12]  Johannes Gehrke,et al.  Gossip-based computation of aggregate information , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[13]  Yiming Hu,et al.  Towards efficient load balancing in structured P2P systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[14]  Srinivasan Seshan,et al.  Mercury: supporting scalable multi-attribute range queries , 2004, SIGCOMM '04.

[15]  Aris M. Ouksel,et al.  Merging G-Grid P2P Systems While Preserving Their Autonomy , 2004, P2PKM.

[16]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[17]  Aris M. Ouksel,et al.  G-Grid: A Class of Scalable and Self-Organizing Data Structures for Multi-dimensional Querying and Content Routing in P2P Networks , 2003, AP2PC.

[18]  Aris M. Ouksel The interpolation-based grid file , 1985, PODS '85.

[19]  Dimitrios Gunopulos,et al.  Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[20]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[21]  Jared Saia,et al.  Choosing a random peer , 2004, PODC '04.

[22]  Theoni Pitoura,et al.  Load Distribution Fairness in P2P Data Management Systems , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[23]  L. M. M.-T. Theory of Probability , 1929, Nature.

[24]  Yannis E. Ioannidis,et al.  Universality of Serial Histograms , 1993, VLDB.

[25]  Nick Roussopoulos,et al.  Adaptive selectivity estimation using query feedback , 1994, SIGMOD '94.

[26]  P. Yip,et al.  Discrete Cosine Transform: Algorithms, Advantages, Applications , 1990 .

[27]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[28]  Stavros Christodoulakis,et al.  Estimating record selectivities , 1983, Inf. Syst..

[29]  Vijay Kumar,et al.  Management of Concurrency in Interpolation Based Grid File Organization and its Performance , 1994, Inf. Sci..

[30]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[31]  Christos Gkantsidis,et al.  Random walks in peer-to-peer networks , 2004, IEEE INFOCOM 2004.

[32]  Anthony K. H. Tung,et al.  Efficient Skyline Query Processing on Peer-to-Peer Networks , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[33]  Nikos A. Vlassis,et al.  Newscast EM , 2004, NIPS.

[34]  N. J. A. Sloane,et al.  Gray codes for reflection groups , 1989, Graphs Comb..

[35]  Beng Chin Ooi,et al.  Supporting multi-dimensional range queries in peer-to-peer systems , 2005, Fifth IEEE International Conference on Peer-to-Peer Computing (P2P'05).

[36]  Aris M. Ouksel,et al.  A robust and efficient spatial data structure , 1992, Acta Informatica.

[37]  Yannis E. Ioannidis,et al.  Balancing histogram optimality and practicality for query result size estimation , 1995, SIGMOD '95.