Resampling-Based Gap Analysis for Detecting Nodes with High Centrality on Large Social Network

We address a problem of identifying nodes having a high centrality value in a large social network based on its approximation derived only from nodes sampled from the network. More specifically, we detect gaps between nodes with a given confidence level, assuming that we can say a gap exists between two adjacent nodes ordered in descending order of approximations of true centrality values if it can divide the ordered list of nodes into two groups so that any node in one group has a higher centrality value than any one in another group with a given confidence level. To this end, we incorporate confidence intervals of true centrality values, and apply the resampling-based framework to estimate the intervals as accurately as possible. Furthermore, we devise an algorithm that can efficiently detect gaps by making only two passes through the nodes, and empirically show, using three real world social networks, that the proposed method can successfully detect more gaps, compared to the one adopting a standard error estimation framework, using the same node coverage ratio, and that the resulting gaps enable us to correctly identify a set of nodes having a high centrality value.

[1]  Laks V. S. Lakshmanan,et al.  Information and Influence Propagation in Social Networks , 2013, Synthesis Lectures on Data Management.

[2]  Jens Teubner,et al.  Data Processing on FPGAs , 2013, Proc. VLDB Endow..

[3]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[4]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[5]  Masahiro Kimura,et al.  Resampling-Based Framework for Estimating Node Centrality of Large Social Network , 2014, Discovery Science.

[6]  P. Bonacich Power and Centrality: A Family of Measures , 1987, American Journal of Sociology.

[7]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[8]  Athina Markopoulou,et al.  Towards Unbiased BFS Sampling , 2011, IEEE Journal on Selected Areas in Communications.

[9]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[10]  Marc Najork,et al.  On near-uniform URL sampling , 2000, Comput. Networks.

[11]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[12]  M. Newman Erratum: Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality (Physical Review e (2001) 64 (016132)) , 2006 .

[13]  M. Newman,et al.  Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[14]  Hai Zhuge,et al.  Topological centrality and its e-Science applications , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Leo Katz,et al.  A new status index derived from sociometric analysis , 1953 .