Estimating clustering coefficients and size of social networks via random walk

Online social networks have become a major force in today's society and economy. The largest of today's social networks may have hundreds of millions to more than a billion users. Such networks are too large to be downloaded or stored locally, even if terms of use and privacy policies were to permit doing so. This limitation complicates even simple computational tasks. One such task is computing the clustering coefficient of a network. Another task is to compute the network size (number of registered users) or a subpopulation size. The clustering coefficient, a classic measure of network connectivity, comes in two flavors, global and network average. In this work, we provide efficient algorithms for estimating these measures which (1) assume no prior knowledge about the network; and (2) access the network using only the publicly available interface. More precisely, this work provides three new estimation algorithms (a) the first external access algorithm for estimating the global clustering coefficient; (b) an external access algorithm that improves on the accuracy of previous network average clustering coefficient estimation algorithms; and (c) an improved external access network size estimation algorithm. The main insight offered by this work is that only a relatively small number of public interface calls are required to allow our algorithms to achieve a high accuracy estimation. Our approach is to view a social network as an undirected graph and use the public interface to retrieve a random walk. To estimate the clustering coefficient, the connectivity of each node in the random walk sequence is tested in turn. We show that the error of this estimation drops exponentially in the number of random walk steps. Another insight of this work is the fact that, although the proposed algorithms can be used to estimate the clustering coefficient of any undirected graph, they are particularly efficient on social network-like graphs. To improve the network size prior-art estimation algorithms, we count node collision one step before they actually occur. In our experiments we validate our algorithms on several publicly available social network datasets. Our results validate the theoretical claims and demonstrate the effectiveness of our algorithms.

[1]  M. Newman,et al.  Scaling and percolation in the small-world network model. , 1999, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[2]  Seungyeop Han,et al.  Analysis of topological characteristics of huge online social networking services , 2007, WWW '07.

[3]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[4]  Anne-Marie Kermarrec,et al.  Peer counting and sampling in overlay networks: random walk methods , 2006, PODC '06.

[5]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[6]  Athina Markopoulou,et al.  Graph Size Estimation , 2012, ArXiv.

[7]  Jon M. Kleinberg,et al.  Group formation in large social networks: membership, growth, and evolution , 2006, KDD '06.

[8]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[9]  Kai-Min Chung,et al.  Chernoff-Hoeffding Bounds for Markov Chains: Generalized and Simplified , 2012, STACS.

[10]  Jérôme Kunegis,et al.  KONECT: the Koblenz network collection , 2013, WWW.

[11]  Michael Ley,et al.  The DBLP Computer Science Bibliography: Evolution, Research Issues, Perspectives , 2002, SPIRE.

[12]  Aziz Mohaisen,et al.  Measuring the mixing time of social graphs , 2010, IMC '10.

[13]  Dirk P. Kroese,et al.  Simulation and the Monte Carlo Method (Wiley Series in Probability and Statistics) , 1981 .

[14]  Donald F. Towsley,et al.  Efficiently Estimating Motif Statistics of Large Networks , 2013, TKDD.

[15]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.

[16]  Minas Gjoka,et al.  2.5K-graphs: From sampling to generation , 2012, 2013 Proceedings IEEE INFOCOM.

[17]  Imbi Traat,et al.  Simulation and the Monte Carlo Method, 2nd Edition by Reuven Y. Rubinstein, Dirk P. Kroese , 2009 .

[18]  Peter Richmond,et al.  Calculating statistics of complex networks through random walks with an application to the on-line social network Bebo , 2009 .

[19]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[20]  Shyhtsun Felix Wu,et al.  Estimating the Size of Online Social Networks , 2010, 2010 IEEE Second International Conference on Social Computing.

[21]  Edo Liberty,et al.  Estimating Sizes of Social Networks via Biased Sampling , 2014, Internet Math..

[22]  Krishna P. Gummadi,et al.  Growth of the flickr social network , 2008, WOSN '08.

[23]  Ziv Bar-Yossef,et al.  Estimating the impressionrank of web pages , 2009, WWW '09.

[24]  H. Avron Counting Triangles in Large Graphs using Randomized Matrix Trace Estimation , 2010 .

[25]  W. Härdle,et al.  Bootstrap Methods for Time Series , 2003 .

[26]  Dorothea Wagner,et al.  Approximating Clustering Coefficient and Transitivity , 2005, J. Graph Algorithms Appl..

[27]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[28]  M. Newman,et al.  Renormalization Group Analysis of the Small-World Network Model , 1999, cond-mat/9903357.

[29]  Peter Winkler,et al.  Mixing times , 1997, Microsurveys in Discrete Probability.

[30]  Ziv Bar-Yossef,et al.  Random sampling from a search engine's index , 2006, WWW '06.

[31]  V. Climenhaga Markov chains and mixing times , 2013 .

[32]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[33]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[34]  Tao Lei,et al.  The Mixing Time of the Newman-Watts Small-World Model , 2012, Advances in Applied Probability.

[35]  Dirk P. Kroese,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[36]  Noga Alon,et al.  Finding and counting given length cycles , 1997, Algorithmica.

[37]  L. da F. Costa,et al.  Characterization of complex networks: A survey of measurements , 2005, cond-mat/0505185.

[38]  Luca Becchetti,et al.  Efficient algorithms for large-scale local triangle counting , 2010, TKDD.