Inferring Higher-Order Structure Statistics of Large Networks from Sampled Edges

Recently exploring locally connected subgraphs (also known as motifs or graphlets) of complex networks attracts a lot of attention. Previous work made the strong assumption that the graph topology of interest is known in advance. In practice, sometimes researchers have to deal with the situation where the graph topology is unknown because it is expensive to collect and store all topological information. Hence, typically what is available to researchers is only a snapshot of the graph, i.e., a subgraph of the graph. Crawling methods such as breadth first sampling can be used to generate the snapshot. However, these methods fail to sample a streaming graph represented as a high speed stream of edges. Therefore, graph mining applications such as network traffic monitoring usually use random edge sampling (i.e., sample each edge with a fixed probability) to collect edges and generate a sampled graph, which we call a “ RESampled graph”. Clearly, a RESampled graph's motif statistics may be quite different from those of the original graph. To resolve this, we propose a framework Minfer, which takes the given RESampled graph and accurately infers the underlying graph's motif statistics. Experiments using large scale datasets show the accuracy and efficiency of our method.

[1]  Alexandros G. Dimakis,et al.  Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs , 2015, KDD.

[2]  Lin Ma,et al.  Parallel subgraph listing in a large-scale graph , 2014, SIGMOD Conference.

[3]  Darryl Veitch,et al.  Towards optimal sampling for flow size estimation , 2008, IMC '08.

[4]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[5]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[6]  Jing Tao,et al.  A New Sketch Method for Measuring Host Connection Degree Distribution , 2014, IEEE Transactions on Information Forensics and Security.

[7]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[8]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[9]  Alan M. Frieze,et al.  Random graphs , 2006, SODA '06.

[10]  Alexandros G. Dimakis,et al.  Distributed Estimation of Graph 4-Profiles , 2016, WWW.

[11]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[12]  Fernando M. A. Silva,et al.  Parallel Subgraph Counting for Multicore Architectures , 2014, 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[13]  Jure Leskovec,et al.  Predicting positive and negative links in online social networks , 2010, WWW '10.

[14]  Kurt Mehlhorn,et al.  Approximate Counting of Cycles in Streams , 2011, ESA.

[15]  Noga Alon,et al.  Color-coding , 1995, JACM.

[16]  Thomas Schank,et al.  Algorithmic Aspects of Triangle-Based Network Analysis , 2007 .

[17]  Ramana Rao Kompella,et al.  Graph sample and hold: a framework for big-graph analytics , 2014, KDD.

[18]  Donald F. Towsley,et al.  Efficiently Estimating Motif Statistics of Large Networks , 2013, TKDD.

[19]  Réka Albert,et al.  Conserved network motifs allow protein-protein interaction prediction , 2004, Bioinform..

[20]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[21]  Hawoong Jeong,et al.  Comparison of online social relations in volume vs interaction: a case study of cyworld , 2008, IMC '08.

[22]  Jeffrey Xu Yu,et al.  Finding maximal cliques in massive networks , 2011, TODS.

[23]  Mihail N. Kolountzakis,et al.  Triangle Sparsifiers , 2011, J. Graph Algorithms Appl..

[24]  Carsten Lund,et al.  Estimating flow distributions from sampled flow statistics , 2003, SIGCOMM '03.

[25]  Minas Gjoka,et al.  Estimating clique composition and size distributions from sampled network data , 2013, 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[26]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[27]  Jure Leskovec,et al.  Signed networks in social media , 2010, CHI.

[28]  L. Takac DATA ANALYSIS IN PUBLIC SOCIAL NETWORKS , 2012 .

[29]  Mohammad Al Hasan,et al.  Output Space Sampling for Graph Patterns , 2009, Proc. VLDB Endow..

[30]  Hawoong Jeong,et al.  Comparison of Online Social Relations in terms of Volume vs . Interaction : A Case Study of Cyworld , 2008 .

[31]  Ian T. Foster,et al.  Mapping the Gnutella Network: Properties of Large-Scale Peer-to-Peer Systems and Implications for System Design , 2002, ArXiv.

[32]  Mong-Li Lee,et al.  NeMoFinder: dissecting genome-wide protein-protein interactions with meso-scale network motifs , 2006, KDD '06.

[33]  Donald F. Towsley,et al.  Fisher information of sampled packets: an application to flow size estimation , 2006, IMC '06.

[34]  Uri Alon,et al.  Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs , 2004, Bioinform..

[35]  Kun-Lung Wu,et al.  Counting and Sampling Triangles from a Graph Stream , 2013, Proc. VLDB Endow..

[36]  Christian Bauckhage,et al.  The slashdot zoo: mining a social network with negative edges , 2009, WWW.

[37]  Christos Faloutsos,et al.  DOULION: counting triangles in massive graphs with a coin , 2009, KDD.

[38]  Mohammad Al Hasan,et al.  GUISE: Uniform Sampling of Graphlets for Large Graph Analysis , 2012, 2012 IEEE 12th International Conference on Data Mining.

[39]  James Cheng,et al.  Triangle listing in massive networks and its applications , 2011, KDD.

[40]  Jure Leskovec,et al.  Defining and Evaluating Network Communities Based on Ground-Truth , 2012, ICDM.

[41]  Azadeh Iranmehr,et al.  Trust Management for Semantic Web , 2009, 2009 Second International Conference on Computer and Electrical Engineering.

[42]  F. Schreiber,et al.  MODA: an efficient algorithm for network motif discovery in biological networks. , 2009, Genes & genetic systems.

[43]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[44]  Uri Alon,et al.  Coarse-graining and self-dissimilarity of complex networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  Jure Leskovec,et al.  Higher-order organization of complex networks , 2016, Science.

[46]  Don Towsley,et al.  Empirical analysis of the evolution of follower network: A case study on Douban , 2011, 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[47]  Jon M. Kleinberg,et al.  Subgraph frequencies: mapping the empirical and extremal geography of large graph collections , 2013, WWW.

[48]  Yuval Shavitt,et al.  RAGE - A rapid graphlet enumerator for large networks , 2012, Comput. Networks.

[49]  Sebastian Wernicke,et al.  Efficient Detection of Network Motifs , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[50]  B. McKay nauty User ’ s Guide ( Version 2 . 4 ) , 1990 .