Waddling Random Walk: Fast and Accurate Mining of Motif Statistics in Large Graphs

Algorithms for mining very large graphs, such as those representing online social networks, to discover the relative frequency of small subgraphs within them are of high interest to sociologists, computer scientists and marketeers alike. However, the computation of these network motif statistics via naive enumeration is infeasible for either its prohibitive computational costs or access restrictions on the full graph data. Methods to estimate the motif statistics based on random walks by sampling only a small fraction of the subgraphs in the large graph address both of these challenges. In this paper, we present a new algorithm, called the Waddling Random Walk (WRW), which estimates the concentration of motifs of any size. It derives its name from the fact that it sways a little to the left and to the right, thus also sampling nodes not directly on the path of the random walk. The WRW algorithm achieves its computational efficiency by not trying to enumerate subgraphs around the random walk but instead using a randomized protocol to sample subgraphs in the neighborhood of the nodes visited by the walk. In addition, WRW achieves significantly higher accuracy (measured by the closeness of its estimate to the correct value) and higher precision (measured by the low variance in its estimations) than the current state-of-the-art algorithms for mining subgraph statistics. We illustrate these advantages in speed, accuracy and precision using simulations on well-known and widely used graph datasets representing real networks.

[1]  Hawoong Jeong,et al.  Comparison of online social relations in volume vs interaction: a case study of cyworld , 2008, IMC '08.

[2]  Mohammad Al Hasan,et al.  GUISE: Uniform Sampling of Graphlets for Large Graph Analysis , 2012, 2012 IEEE 12th International Conference on Data Mining.

[3]  James Cheng,et al.  Triangle listing in massive networks and its applications , 2011, KDD.

[4]  David Hales,et al.  Motifs in evolving cooperative networks look like protein structure networks , 2008, Networks Heterog. Media.

[5]  Tijana Milenkoviæ,et al.  Uncovering Biological Network Function via Graphlet Degree Signatures , 2008, Cancer informatics.

[6]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[7]  Janez Demsar,et al.  A combinatorial approach to graphlet counting , 2014, Bioinform..

[8]  F. Göbel,et al.  Random walks on graphs , 1974 .

[9]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[10]  Donald F. Towsley,et al.  Minfer: Inferring Motif Statistics From Sampled Edges , 2015, ArXiv.

[11]  Kai-Min Chung,et al.  Chernoff-Hoeffding Bounds for Markov Chains: Generalized and Simplified , 2012, STACS.

[12]  Ramana Rao Kompella,et al.  Graph sample and hold: a framework for big-graph analytics , 2014, KDD.

[13]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[14]  Ali Pinar,et al.  Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts , 2014, WWW.

[15]  Mohammed J. Zaki,et al.  Structural correlation pattern mining for large graphs , 2010, MLG '10.

[16]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[17]  J. Delvenne,et al.  Random walks on graphs , 2004 .

[18]  Lawrence B. Holder,et al.  Frequent subgraph mining on a single large graph using sampling techniques , 2010, MLG '10.

[19]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[20]  Xiang-Sun Zhang,et al.  Hubs with Network Motifs Organize Modularity Dynamically in the Protein-Protein Interaction Network of Yeast , 2007, PloS one.

[21]  Jyun-Cheng Wang,et al.  How online social ties and product-related risks influence purchase intentions: A Facebook experiment , 2013, Electron. Commer. Res. Appl..

[22]  Alexandros G. Dimakis,et al.  Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs , 2015, KDD.

[23]  Mohammad Al Hasan,et al.  Finding Network Motifs Using MCMC Sampling , 2015, CompleNet.

[24]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[25]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[26]  Donald F. Towsley,et al.  Minfer: A method of inferring motif statistics from sampled edges , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[27]  Simona E. Rombo,et al.  Searching for repetitions in biological networks: methods, resources and tools , 2015, Briefings Bioinform..

[28]  Liran Katzir,et al.  Estimating clustering coefficients and size of social networks via random walk , 2013, TWEB.

[29]  Mohammad Al Hasan,et al.  GRAFT: an approximate graphlet counting algorithm for large graph analysis , 2012, CIKM.

[30]  Yi Pan,et al.  Biological network motif detection and evaluation , 2011, BMC Systems Biology.

[31]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[32]  Peter Winkler,et al.  Mixing times , 1997, Microsurveys in Discrete Probability.

[33]  Kurt Mehlhorn,et al.  Efficient graphlet kernels for large graph comparison , 2009, AISTATS.

[34]  Jon M. Kleinberg,et al.  Subgraph frequencies: mapping the empirical and extremal geography of large graph collections , 2013, WWW.

[35]  Ali Pinar,et al.  A space efficient streaming algorithm for triangle counting using the birthday paradox , 2012, KDD.

[36]  Donald F. Towsley,et al.  Efficiently Estimating Motif Statistics of Large Networks , 2013, TKDD.

[37]  Leandros Tassiulas,et al.  Network science, web science, and internet science , 2015, Commun. ACM.

[38]  Jing Tao,et al.  Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets , 2015, ArXiv.

[39]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  Yoram Louzoun,et al.  An optimal algorithm for counting network motifs , 2007 .

[41]  Walter Willinger,et al.  On Unbiased Sampling for Unstructured Peer-to-Peer Networks , 2006, IEEE/ACM Transactions on Networking.

[42]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.