Set-based approximate approach for lossless graph summarization

Graph summarization is valuable approach to analyze various real life phenomenon, like communities, influential nodes, and information flow in a big graph. To summarize a graph, nodes having similar neighbors are merged into super nodes and their corresponding edges are compressed into super edges. Existing methods find similar nodes either by nodes ordering or perform pairwise similarity computations. Compression-by-node ordering approaches are scalable but provide lesser compression due to exhaustive similarity computations of their counterparts. In this paper, we propose a novel set-based summarization approach that directly summarizes naturally occurring sets of similar nodes in a graph. Our approach is scalable since we avoid explicit similarity computations with non-similar nodes and merge sets of nodes in each iteration. Similarly, we provide good compression ratio as each set consists of highly similar nodes. To locate sets of similar nodes, we find candidate sets of similar nodes by using locality sensitive hashing. However, member nodes of every candidate set have varying similarities with each other. Therefore, we propose a heuristic based on similarity among degrees of candidate nodes, and a parameter-free pruning technique to effectively identify subset of highly similar nodes from candidate nodes. Through experiments on real world graphs, our approach requires lesser execution time than pairwise graph summarization, with margin of an order of magnitude in graphs containing nodes with highly diverse neighborhood, and produces summary at similar accuracy. Similarly, we observe comparable scalability against the compression-by-node ordering method, while providing better compression ratio.

[1]  Lawrence B. Holder,et al.  Subdue: compression-based frequent pattern discovery in graph data , 2005 .

[2]  Ambuj K. Singh,et al.  Scalable discovery of best clusters on large graphs , 2010, Proc. VLDB Endow..

[3]  Nisheeth Shrivastava,et al.  Graph summarization with bounded error , 2008, SIGMOD Conference.

[4]  Shuicheng Yan,et al.  Non-Metric Locality-Sensitive Hashing , 2010, AAAI.

[5]  Gonzalo Navarro,et al.  Compressed representations for web and social graphs , 2013, Knowledge and Information Systems.

[6]  Richard C. Rose,et al.  Efficient manifold learning for speech recognition using locality sensitive hashing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Sudarshan S. Chawathe,et al.  SEuS: Structure Extraction Using Summaries , 2002, Discovery Science.

[8]  Yannis Manolopoulos,et al.  Efficient similarity search for market basket data , 2002, The VLDB Journal.

[9]  Fang Zhou,et al.  Compression of weighted graphs , 2011, KDD.

[10]  Jiawei Han,et al.  Graph cube: on warehousing and OLAP multidimensional networks , 2011, SIGMOD '11.

[11]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[12]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[13]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[14]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[15]  Christos Faloutsos,et al.  Interestingness-Driven Diffusion Process Summarization in Dynamic Networks , 2014, ECML/PKDD.

[16]  Michael Isard,et al.  General Theory , 1969 .

[17]  Siyuan Liu,et al.  CIM: categorical influence maximization , 2011, ICUIMC '11.

[18]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[19]  Philip S. Yu,et al.  Feature-based similarity search in graph structures , 2006, TODS.

[20]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[21]  Philip S. Yu,et al.  Efficient Topological OLAP on Information Networks , 2011, DASFAA.

[22]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[23]  Philip S. Yu,et al.  Top-k Similarity Join in Heterogeneous Information Networks , 2015, IEEE Transactions on Knowledge and Data Engineering.

[24]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[25]  Evimaria Terzi,et al.  GraSS: Graph Structure Summarization , 2010, SDM.

[26]  Bin Wu,et al.  HMGraph OLAP: a novel framework for multi-dimensional heterogeneous network analysis , 2012, DOLAP '12.

[27]  Danai Koutra,et al.  Summarizing and understanding large graphs , 2014, Stat. Anal. Data Min..

[28]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[29]  Qiang Qu,et al.  A direct mining approach to efficient constrained graph pattern discovery , 2013, SIGMOD '13.

[30]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[31]  Christos Faloutsos,et al.  Beyond 'Caveman Communities': Hubs and Spokes for Graph Compression and Mining , 2011, 2011 IEEE 11th International Conference on Data Mining.

[32]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[33]  Matthieu Cord,et al.  Locality-Sensitive Hashing for Chi2 Distance , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Ramayya Krishnan,et al.  HYDRA: large-scale social identity linkage via heterogeneous behavior modeling , 2014, SIGMOD Conference.

[35]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[36]  Philip S. Yu,et al.  Graph OLAP: Towards Online Analytical Processing on Graphs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[37]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[38]  Heng Tao Shen,et al.  Hashing for Similarity Search: A Survey , 2014, ArXiv.