Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data

In this paper we show how to efficiently produce unbiased estimates of subgraph frequencies from a probability sample of egocentric networks (i.e., focal nodes, their neighbors, and the induced subgraphs of ties among their neighbors). A key feature of our proposed method that differentiates it from prior methods is the use of egocentric data. Because of this, our method is suitable for estimation in large unknown graphs, is easily parallelizable, handles privacy sensitive network data (e.g. egonets with no neighbor labels), and supports counting of large subgraphs (e.g. maximal clique of size 205 in Section 6) by building on top of existing exact subgraph counting algorithms that may not support sampling. It gracefully handles a variety of sampling designs such as uniform or weighted independence or random walk sampling. Our method can be used for subgraphs that are: (i) undirected or directed; (ii) induced or non-induced; (iii) maximal or non-maximal; and (iv) potentially annotated with attributes. We compare our estimators on a variety of real-world graphs and sampling methods and provide suggestions for their use. Simulation shows that our method outperforms the state-of-the-art approach for relative subgraph frequencies by up to an order of magnitude for the same sample size. Finally, we apply our methodology to a rare sample of Facebook users across the social graph to estimate and interpret the clique size distribution and gender composition of cliques.

[1]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[2]  N. Butt Sampling with Unequal Probabilities , 2003 .

[3]  S. M.G. Caldeira,et al.  The network of concepts in written texts , 2005, physics/0508066.

[4]  Mason A. Porter,et al.  Social Structure of Facebook Networks , 2011, ArXiv.

[5]  Sahar Asadi,et al.  Kavosh: a new algorithm for finding network motifs , 2009, BMC Bioinformatics.

[6]  P. Holland,et al.  TRANSITIVITY IN STRUCTURAL MODELS OF SMALL GROUPS , 1977 .

[7]  Minas Gjoka,et al.  Walking on a graph with a magnifying glass: stratified sampling via weighted random walks , 2011, PERV.

[8]  P. V. Marsden,et al.  NETWORK DATA AND MEASUREMENT , 1990 .

[9]  Athina Markopoulou,et al.  ergm.graphlets: A Package for ERG Modeling Based on Graphlet Statistics , 2014, ArXiv.

[10]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[11]  Joshua A. Grochow,et al.  Network Motif Discovery Using Subgraph Enumeration and Symmetry-Breaking , 2007, RECOMB.

[12]  Ronald S. Burt,et al.  Network items and the general social survey , 1984 .

[13]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[14]  M. H. Hansen,et al.  On the Theory of Sampling from Finite Populations , 1943 .

[15]  Natasa Przulj,et al.  Biological network comparison using graphlet degree distribution , 2007, Bioinform..

[16]  Sebastian Wernicke,et al.  FANMOD: a tool for fast network motif detection , 2006, Bioinform..

[17]  Peter Richmond,et al.  Calculating statistics of complex networks through random walks with an application to the on-line social network Bebo , 2009 .

[18]  Minas Gjoka,et al.  Practical Recommendations on Crawling Online Social Networks , 2011, IEEE Journal on Selected Areas in Communications.

[19]  Luciano Rossoni,et al.  Models and methods in social network analysis , 2006 .

[20]  Martin Kilduff,et al.  Structure, culture and Simmelian ties in entrepreneurial firms , 2002, Soc. Networks.

[21]  Mohammad Al Hasan,et al.  GUISE: Uniform Sampling of Graphlets for Large Graph Analysis , 2012, 2012 IEEE 12th International Conference on Data Mining.

[22]  L. Smith-Lovin,et al.  Sex and Race Homogeneity in Naturally Occurring Groups , 1995 .

[23]  P. Holland,et al.  An Exponential Family of Probability Distributions for Directed Graphs , 1981 .

[24]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[25]  S. Feld The Focused Organization of Social Ties , 1981, American Journal of Sociology.

[26]  Minas Gjoka,et al.  Estimating clique composition and size distributions from sampled network data , 2013, 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[27]  Krishna P. Gummadi,et al.  On the evolution of user interaction in Facebook , 2009, WOSN '09.

[28]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[29]  M. Ruiz Espejo Sampling , 2013, Encyclopedic Dictionary of Archaeology.

[30]  Edo Liberty,et al.  Estimating Sizes of Social Networks via Biased Sampling , 2014, Internet Math..

[31]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data , 2009 .

[32]  Donald F. Towsley,et al.  Efficiently Estimating Motif Statistics of Large Networks , 2013, TKDD.

[33]  Joel Sokol,et al.  Optimal Protein Structure Alignment Using Maximum Cliques , 2005, Oper. Res..

[34]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[35]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[36]  Chun-Hsi Huang,et al.  Biological network motif detection: principles and practice , 2012, Briefings Bioinform..

[37]  Falk Schreiber,et al.  MAVisto: a tool for the exploration of network motifs , 2005, Bioinform..

[38]  Mong-Li Lee,et al.  NeMoFinder: dissecting genome-wide protein-protein interactions with meso-scale network motifs , 2006, KDD '06.

[39]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[40]  Carter T. Butts,et al.  Extended structures of mediation: Re-examining brokerage in dynamic networks , 2013, Soc. Networks.

[41]  F. Schreiber,et al.  MODA: an efficient algorithm for network motif discovery in biological networks. , 2009, Genes & genetic systems.

[42]  Ove Frank,et al.  http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained , 2007 .

[43]  P. Pattison,et al.  New Specifications for Exponential Random Graph Models , 2006 .

[44]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[45]  Janez Demsar,et al.  A combinatorial approach to graphlet counting , 2014, Bioinform..

[46]  P. Holland,et al.  The Statistical Analysis of Local Structure in Social Networks , 1974 .

[47]  P Willett,et al.  Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. , 1993, Journal of molecular biology.

[48]  P. Killworth,et al.  Informant accuracy in social network data IV: a comparison of clique-level structure in behavioral and cognitive network data , 1979 .

[49]  Fred Stentiford,et al.  Image recognition using maximal cliques of interest points , 2010, 2010 IEEE International Conference on Image Processing.

[50]  Ingegerd Jansson,et al.  Clique structure in school class data , 1997 .

[51]  David Eppstein,et al.  Listing All Maximal Cliques in Large Sparse Real-World Graphs , 2011, JEAL.

[52]  Roger V. Gould,et al.  Structures of Mediation: A Formal Approach to Brokerage in Transaction Networks , 1989 .