Uniform Data Sampling from a Peer-to-Peer Network

Uniform random sample is often useful in analyzing data. Usually taking a uniform sample is not a problem if the entire data resides in one location. However, if the data is distributed in a peer-to-peer (P2P) network with different amount of data in different peers, collecting a uniform sample of data becomes a challenging task. A random sampling can be performed using random-walk, but due to varying degrees of connectivity and different sizes of data owned by each peer, this random walk gives a biased sample. In this paper, we propose a random walk-based sampling algorithm that can be used to sample data tuples uniformly from a large, unstructured P2P network. We model the random walk as a Markov chain and derive conditions to bound the length of the random walk necessary to achieve uniformity. A formal communication analysis shows logarithmic communication cost to discover a uniform data sample.

[1]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[2]  M WojtekKowalczyk,et al.  Towards Data Mining in Large and Fully Distributed Peer-to-Peer Overlay Networks , 2003 .

[3]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[4]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[5]  John Odentrantz,et al.  Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues , 2000, Technometrics.

[6]  Suresh Jagannathan,et al.  Distributed Uniform Sampling in Unstructured Peer-to-Peer Networks , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).

[7]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[8]  Christos Gkantsidis,et al.  Random walks in peer-to-peer networks , 2004, IEEE INFOCOM 2004.

[9]  Steve Chien,et al.  Approximating Aggregate Queries about Web Pages via Random Walks , 2000, VLDB.

[10]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[11]  Pekka Orponen,et al.  Efficient Algorithms for Sampling and Clustering of Large Nonuniform Networks , 2004 .

[12]  Alistair Sinclair,et al.  Improved Bounds for Mixing Rates of Markov Chains and Multicommodity Flow , 1992, Combinatorics, Probability and Computing.

[13]  Krishna P. Gummadi,et al.  Measuring and analyzing the characteristics of Napster and Gnutella hosts , 2003, Multimedia Systems.

[14]  Hillol Kargupta,et al.  K-Means Clustering Over a Large, Dynamic Network , 2006, SDM.

[15]  Martin E. Dyer,et al.  Sampling regular graphs and a peer-to-peer network , 2005, SODA '05.

[16]  Jeffrey Considine,et al.  Approximately uniform random sampling in sensor networks , 2004, DMSN '04.

[17]  27th International Conference on Distributed Computing Systems Workshops (ICDCS 2007 Workshops), June 25-29, 2007, Toronto, Ontario, Canada , 2007, ICDCS Workshops.