Estimating the Size of a Large Network and its Communities from a Random Sample

Most real-world networks are too large to be measured or studied directly and there is substantial interest in estimating global network properties from smaller sub-samples. One of the most important global properties is the number of vertices/nodes in the network. Estimating the number of vertices in a large network is a major challenge in computer science, epidemiology, demography, and intelligence analysis. In this paper we consider a population random graph G = (V, E) from the stochastic block model (SBM) with K communities/blocks. A sample is obtained by randomly choosing a subset W ⊆ V and letting G(W) be the induced subgraph in G of the vertices in W. In addition to G(W), we observe the total degree of each sampled vertex and its block membership. Given this partial information, we propose an efficient PopULation Size Estimation algorithm, called PULSE, that accurately estimates the size of the whole population as well as the size of each community. To support our theoretical analysis, we perform an exhaustive set of experiments to study the effects of sample size, K, and SBM model parameters on the accuracy of the estimates. The experimental results also demonstrate that PULSE significantly outperforms a widely-used method called the network scale-up estimator in a wide variety of scenarios.

[1]  Cyveillance Sizing the Internet , 2000 .

[2]  Matthew J. Salganik,et al.  Generalizing the Network Scale-up Method , 2014, Sociological methodology.

[3]  Forrest W. Crawford Hidden network reconstruction from information diffusion , 2015, 2015 18th International Conference on Information Fusion (Fusion).

[4]  H. Russell Bernard,et al.  Who Knows Your HIV Status II?: Information Propagation Within Social Networks of Seropositive People , 2006 .

[5]  G. Birkhoff Note on the gamma function , 1913 .

[6]  Forrest W. Crawford A recruitment model and population size estimation for respondent-driven sampling , 2014 .

[7]  Anne-Marie Kermarrec,et al.  Peer counting and sampling in overlay networks: random walk methods , 2006, PODC '06.

[8]  Soichi Koike,et al.  Population Size Estimation of Men Who Have Sex with Men through the Network Scale-Up Method in Japan , 2012, PloS one.

[9]  Edo Liberty,et al.  Estimating Sizes of Social Networks via Biased Sampling , 2014, Internet Math..

[10]  H. Russell Bernard,et al.  Estimation of Seroprevalence, Rape, and Homelessness in the United States Using a Social Network Approach , 1998, Evaluation review.

[11]  Mohammad Reza Baneshi,et al.  Size Estimation of Groups at High Risk of HIV/AIDS using Network Scale Up in Kerman, Iran , 2012, International journal of preventive medicine.

[12]  D. Aldous Exchangeability and related topics , 1985 .

[13]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[14]  Tanya Y. Berger-Wolf,et al.  Benefits of bias: towards better characterization of network sampling , 2011, KDD.

[15]  Nick Koudas,et al.  Sampling Online Social Networks , 2013, IEEE Transactions on Knowledge and Data Engineering.

[16]  H. Russell Bernard,et al.  Estimating the Ripple Effect of a Disaster 1 , 2001 .

[17]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Amin Karbasi,et al.  Seeing the Unseen Network: Inferring Hidden Social Ties from Respondent-Driven Sampling , 2016, AAAI.

[19]  H. Russell Bernard,et al.  Scale-Up Methods as Applied to Estimates of Heroin use , 2006 .

[20]  Matthew J. Salganik,et al.  Assessing Network Scale-up Estimates for Groups Most at Risk of HIV/AIDS: Evidence From a Multiple-Method Study of Heavy Drug Users in Curitiba, Brazil , 2011, American journal of epidemiology.

[21]  H. Russell Bernard,et al.  Who knows your HIV status? What HIV + patients and their network members know about each other , 1995 .

[22]  Michael S. Bernstein,et al.  Quantifying the invisible audience in social networks , 2013, CHI.

[23]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[24]  Bernd-Peter Paris,et al.  Measuring the size of the Internet via importance sampling , 2003, IEEE J. Sel. Areas Commun..

[25]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  Lu Wang,et al.  Estimating the Size of HIV Key Affected Populations in Chongqing, China, Using the Network Scale-Up Method , 2013, PloS one.

[27]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.