Measuring bias in the mixing time of social graphs due to graph sampling

Sampling of large social graphs is used for addressing infeasibility of measurements in large social graphs, or for crawling graphs from online social network services where accessing an entire social graph at once is often impossible. Sampling algorithms aim at maintaining certain properties of the original graphs in the sampled (or crawled) ones. Several sampling algorithms, such as breadth-first search, standard random walk, and Metropolis-Hastings random walk, among others, are widely used in the literature for sampling graphs. Some of these sampling algorithms are known for their bias, mainly towards high degree nodes, while bias for other metrics is not well-studied. In this paper we consider the bias of sampling algorithms on the mixing time. We quantitatively show that some existing sampling algorithms, even those which are unbiased to the degree distribution, always produce biased estimation of the mixing time of social graphs. We argue that bias in sampling algorithms accepted in the literature is rather metric-dependent, and a given sampling algorithm, while may work nicely and unbiased to one property, may produce considerable amount of bias in other properties.

[1]  Jure Leskovec,et al.  Predicting positive and negative links in online social networks , 2010, WWW '10.

[2]  Aziz Mohaisen,et al.  SocialCloud: Using Social Networks for Building Distributed Computing Services , 2011, ArXiv.

[3]  Aziz Mohaisen,et al.  Keep your friends close: Incorporating trust into social network-based Sybil defenses , 2011, 2011 Proceedings IEEE INFOCOM.

[4]  C. Lanczos An iteration method for the solution of the eigenvalue problem of linear differential and integral operators , 1950 .

[5]  Prateek Mittal,et al.  X-Vine: Secure and Pseudonymous Routing Using Social Networks , 2011, ArXiv.

[6]  Lakshminarayanan Subramanian,et al.  Optimal Sybil-resilient node admission control , 2011, 2011 Proceedings IEEE INFOCOM.

[7]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[8]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[9]  Stefano Pallottino,et al.  Shortest-path methods: Complexity, interrelations and new propositions , 1984, Networks.

[10]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[11]  Aziz Mohaisen,et al.  On the mixing time of directed social graphs and security implications , 2012, ASIACCS '12.

[12]  Aziz Mohaisen,et al.  Measuring the mixing time of social graphs , 2010, IMC '10.

[13]  Alistair Sinclair,et al.  Improved Bounds for Mixing Rates of Marked Chains and Multicommodity Flow , 1992, LATIN.

[14]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[15]  Rami Puzis,et al.  Routing betweenness centrality , 2010, JACM.

[16]  Alistair Sinclair,et al.  Improved Bounds for Mixing Rates of Markov Chains and Multicommodity Flow , 1992, Combinatorics, Probability and Computing.

[17]  Seungyeop Han,et al.  Analysis of topological characteristics of huge online social networking services , 2007, WWW '07.

[18]  John R. Douceur,et al.  The Sybil Attack , 2002, IPTPS.

[19]  Aziz Mohaisen,et al.  Collaboration in social network-based information dissemination , 2012, 2012 IEEE International Conference on Communications (ICC).

[20]  Minas Gjoka,et al.  Multigraph Sampling of Online Social Networks , 2010, IEEE Journal on Selected Areas in Communications.

[21]  Ben Y. Zhao,et al.  User interactions in social networks and their implications , 2009, EuroSys '09.

[22]  Shishir Nagaraja,et al.  Anonymity in the Wild: Mixes on Unstructured Networks , 2007, Privacy Enhancing Technologies.

[23]  Minas Gjoka,et al.  Walking on a graph with a magnifying glass: stratified sampling via weighted random walks , 2011, PERV.

[24]  Sharon L. Lohr,et al.  Sampling: Design and Analysis , 1999 .

[25]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[26]  Carmela Troncoso,et al.  Drac: An Architecture for Anonymous Low-Volume Communications , 2010, Privacy Enhancing Technologies.

[27]  M. Frans Kaashoek,et al.  Whanau: A Sybil-proof Distributed Hash Table , 2010, NSDI.

[28]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[29]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[30]  Michael Kaminsky,et al.  SybilGuard: defending against sybil attacks via social networks , 2006, SIGCOMM.

[31]  Aziz Mohaisen,et al.  Understanding Social Networks Properties for Trustworthy Computing , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[32]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[33]  Donald F. Towsley,et al.  Improving Random Walk Estimation Accuracy with Uniform Restarts , 2010, WAW.

[34]  Michael Kaminsky,et al.  SybilLimit: A Near-Optimal Social Network Defense against Sybil Attacks , 2008, S&P 2008.

[35]  Athina Markopoulou,et al.  On the bias of BFS , 2010, ArXiv.

[36]  Minas Gjoka,et al.  Practical Recommendations on Crawling Online Social Networks , 2011, IEEE Journal on Selected Areas in Communications.

[37]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.