Sampling Content Distributed Over Graphs

Despite recent effort to estimate topology characteristics of large graphs (i.e., online social networks and peer-to-peer networks), little attention has been given to develop a formal methodology to characterize the vast amount of content distributed over these networks. Due to the large scale nature of these networks, exhaustive enumeration of this content is computationally prohibitive. In this paper, we show how one can obtain content properties by sampling only a small fraction of vertices. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content duplications). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However our experimental results show that one needs to sample most vertices in the graph to obtain accurate statistics using such a method. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to measure content characteristics using available information in sampled contents. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We perform experiments to show WCE and SCE are cost effective and also ``{\em asymptotically unbiased}''. Our methodology provides a new tool for researchers to efficiently query content distributed in large scale networks.

[1]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[2]  Stephen P. Boyd,et al.  Fastest Mixing Markov Chain on a Graph , 2004, SIAM Rev..

[3]  Carsten Lund,et al.  Estimating flow distributions from sampled flow statistics , 2005, TNET.

[4]  Donald F. Towsley,et al.  Improving Random Walk Estimation Accuracy with Uniform Restarts , 2010, WAW.

[5]  Walter Willinger,et al.  Respondent-Driven Sampling for Characterizing Unstructured Overlays , 2009, IEEE INFOCOM 2009.

[6]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[7]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[8]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[9]  Edo Liberty,et al.  Estimating Sizes of Social Networks via Biased Sampling , 2014, Internet Math..

[10]  Donald F. Towsley,et al.  On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling , 2012, IEEE Journal on Selected Areas in Communications.

[11]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[12]  Matthew J. Salganik,et al.  5. Sampling and Estimation in Hidden Populations Using Respondent-Driven Sampling , 2004 .

[13]  Walter Willinger,et al.  On Unbiased Sampling for Unstructured Peer-to-Peer Networks , 2006, IEEE/ACM Transactions on Networking.

[14]  Jin Li,et al.  SocialTube: P2P-Assisted Video Sharing in Online Social Networks , 2012, IEEE Transactions on Parallel and Distributed Systems.

[15]  Ben Y. Zhao,et al.  Deployment of a Large-scale Peer-to-Peer Social Network , 2004, WORLDS.

[16]  Douglas D. Heckathorn,et al.  Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hi , 2002 .

[17]  Ed H. Chi,et al.  Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network , 2010, 2010 IEEE Second International Conference on Social Computing.

[18]  Donald F. Towsley,et al.  Sampling directed graphs with random walks , 2012, 2012 Proceedings IEEE INFOCOM.

[19]  Don Towsley,et al.  On MySpace Account Spans and Double Pareto-Like Distribution of Friends , 2010, 2010 INFOCOM IEEE Conference on Computer Communications Workshops.

[20]  Aziz Mohaisen,et al.  Measuring the mixing time of social graphs , 2010, IMC '10.

[21]  Athina Markopoulou,et al.  Towards Unbiased BFS Sampling , 2011, IEEE Journal on Selected Areas in Communications.

[22]  Jacob Goldenberg,et al.  Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth , 2001 .

[23]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[24]  Minas Gjoka,et al.  Coarse-grained topology estimation via graph sampling , 2011, WOSN '12.

[25]  Ming Zhong,et al.  Random walk based node sampling in self-organizing networks , 2006, OPSR.

[26]  Minas Gjoka,et al.  Multigraph Sampling of Online Social Networks , 2010, IEEE Journal on Selected Areas in Communications.

[27]  Athina Markopoulou,et al.  On the bias of BFS (Breadth First Search) , 2010, 2010 22nd International Teletraffic Congress (lTC 22).

[28]  Minas Gjoka,et al.  Walking on a graph with a magnifying glass: stratified sampling via weighted random walks , 2011, PERV.

[29]  Cristopher Moore,et al.  On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs , 2005, JACM.

[30]  Athina Markopoulou,et al.  Proactive seeding for information cascades in cellular networks , 2012, 2012 Proceedings IEEE INFOCOM.

[31]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[32]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.