A General Framework for Estimating Graphlet Statistics via Random Walk

Graphlets are induced subgraph patterns and have been frequently applied to characterize the local topology structures of graphs across various domains, e.g., online social networks (OSNs) and biological networks. Discovering and computing graphlet statistics are highly challenging. First, the massive size of real-world graphs makes the exact computation of graphlets extremely expensive. Secondly, the graph topology may not be readily available so one has to resort to web crawling using the available application programming interfaces (APIs). In this work, we propose a general and novel framework to estimate graphlet statistics of "any size". Our framework is based on collecting samples through consecutive steps of random walks. We derive an analytical bound on the sample size (via the Chernoff-Hoeffding technique) to guarantee the convergence of our unbiased estimator. To further improve the accuracy, we introduce two novel optimization techniques to reduce the lower bound on the sample size. Experimental evaluations demonstrate that our methods outperform the state-of-the-art method up to an order of magnitude both in terms of accuracy and time cost.

[1]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[2]  Jing Tao,et al.  Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets , 2015, ArXiv.

[3]  T. Milenković,et al.  Systems-level cancer gene identification from protein interaction network topology applied to melanogenesis-related functional genomics data , 2010, Journal of The Royal Society Interface.

[4]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[5]  Alexandros G. Dimakis,et al.  Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs , 2015, KDD.

[6]  Tijana Milenkoviæ,et al.  Uncovering Biological Network Function via Graphlet Degree Signatures , 2008, Cancer informatics.

[7]  Mohammad Al Hasan,et al.  GRAFT: an approximate graphlet counting algorithm for large graph analysis , 2012, CIKM.

[8]  C S LuiJohn,et al.  A general framework for estimating graphlet statistics via random walk , 2016, VLDB 2016.

[9]  Jeffrey Xu Yu,et al.  On random walk based graph sampling , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[10]  Luca Becchetti,et al.  Efficient semi-streaming algorithms for local triangle counting in massive graphs , 2008, KDD.

[11]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[12]  Olle Häggström Finite Markov Chains and Algorithmic Applications , 2002 .

[13]  Janez Demsar,et al.  A combinatorial approach to graphlet counting , 2014, Bioinform..

[14]  Ryan A. Rossi,et al.  Efficient Graphlet Counting for Large Networks , 2015, 2015 IEEE International Conference on Data Mining.

[15]  Mohammad Al Hasan,et al.  GUISE: Uniform Sampling of Graphlets for Large Graph Analysis , 2012, 2012 IEEE 12th International Conference on Data Mining.

[16]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[17]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[18]  Aziz Mohaisen,et al.  Measuring the mixing time of social graphs , 2010, IMC '10.

[19]  Xin Xu,et al.  Beyond random walk and metropolis-hastings samplers: why you should not backtrack for unbiased graph sampling , 2012, SIGMETRICS '12.

[20]  Minas Gjoka,et al.  Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data , 2015, ArXiv.

[21]  Edo Liberty,et al.  Estimating Sizes of Social Networks via Biased Sampling , 2014, Internet Math..

[22]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[23]  Kai-Min Chung,et al.  Chernoff-Hoeffding Bounds for Markov Chains: Generalized and Simplified , 2012, STACS.

[24]  Ramana Rao Kompella,et al.  Graph sample and hold: a framework for big-graph analytics , 2014, KDD.

[25]  Ali Pinar,et al.  Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts , 2014, WWW.

[26]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[27]  Kurt Mehlhorn,et al.  Efficient graphlet kernels for large graph comparison , 2009, AISTATS.

[28]  Jon M. Kleinberg,et al.  Subgraph frequencies: mapping the empirical and extremal geography of large graph collections , 2013, WWW.

[29]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[30]  Donald F. Towsley,et al.  Minfer: Inferring Motif Statistics From Sampled Edges , 2015, ArXiv.

[31]  Charles J. Geyer,et al.  Markov Chain Monte Carlo Lecture Notes , 2005 .

[32]  Gautam Das,et al.  Leveraging History for Faster Sampling of Online Social Networks , 2015, Proc. VLDB Endow..

[33]  Tamara G. Kolda,et al.  Triadic Measures on Graphs: The Power of Wedge Sampling , 2012, SDM.

[34]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[35]  Igor Jurisica,et al.  Modeling interactome: scale-free or geometric? , 2004, Bioinform..

[36]  Xin Xu,et al.  A general framework of hybrid graph sampling for complex network analysis , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[37]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[38]  Donald F. Towsley,et al.  Minfer: A method of inferring motif statistics from sampled edges , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[39]  Liran Katzir,et al.  Estimating clustering coefficients and size of social networks via random walk , 2013, TWEB.

[40]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[41]  Natasa Przulj,et al.  Biological network comparison using graphlet degree distribution , 2007, Bioinform..

[42]  Donald F. Towsley,et al.  Efficiently Estimating Motif Statistics of Large Networks , 2013, TKDD.

[43]  Rizal Setya Perdana What is Twitter , 2013 .

[44]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[45]  Danai Koutra,et al.  Graph based anomaly detection and description: a survey , 2014, Data Mining and Knowledge Discovery.