SWeG: Lossless and Lossy Summarization of Web-Scale Graphs

Given a terabyte-scale graph distributed across multiple machines, how can we summarize it, with much fewer nodes and edges, so that we can restore the original graph exactly or within error bounds? As large-scale graphs are ubiquitous, ranging from web graphs to online social networks, compactly representing graphs becomes important to efficiently store and process them. Given a graph, graph summarization aims to find its compact representation consisting of (a) a summary graph where the nodes are disjoint sets of nodes in the input graph, and each edge indicates the edges between all pairs of nodes in the two sets; and (b) edge corrections for restoring the input graph from the summary graph exactly or within error bounds. Although graph summarization is a widely-used graph-compression technique readily combinable with other techniques, existing algorithms for graph summarization are not satisfactory in terms of speed or compactness of outputs. More importantly, they assume that the input graph is small enough to fit in main memory. In this work, we propose SWeG, a fast parallel algorithm for summarizing graphs with compact representations. SWeG is designed for not only shared-memory but also MapReduce settings to summarize graphs that are too large to fit in main memory. We demonstrate that SWeG is (a) Fast: SWeG is up to 5400 × faster than its competitors that give similarly compact representations, (b) Scalable: SWeG scales to graphs with tens of billions of edges, and (c) Compact: combined with state-of-the-art compression methods, SWeG achieves up to 3.4 × better compression than them.

[1]  Danai Koutra,et al.  Graph Summarization Methods and Applications , 2016, ACM Comput. Surv..

[2]  Alex Thomo,et al.  Probabilistic Graph Summarization , 2013, WAIM.

[3]  Guy E. Blelloch,et al.  Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[4]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[5]  Sriram Raghavan,et al.  Representing Web graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[6]  Aisling Kelliher,et al.  Summarization of social activity over time: people, actions and concepts in dynamic networks , 2008, CIKM '08.

[7]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[8]  Yao Zhang,et al.  Fast influence-based coarsening for large networks , 2014, KDD.

[9]  Aristides Gionis,et al.  Sparsification of influence networks , 2011, KDD.

[10]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[11]  Yasir Mehmood,et al.  CSI: Community-Level Social Influence Analysis , 2013, ECML/PKDD.

[12]  Danai Koutra,et al.  VOG: Summarizing and Understanding Large Graphs , 2014, SDM.

[13]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[14]  Fang Zhou,et al.  Compression of weighted graphs , 2011, KDD.

[15]  Evimaria Terzi,et al.  GraSS: Graph Structure Summarization , 2010, SDM.

[16]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[17]  Yao Zhang,et al.  Condensing Temporal Networks using Propagation , 2017, SDM.

[18]  Jure Leskovec,et al.  Defining and Evaluating Network Communities Based on Ground-Truth , 2012, ICDM.

[19]  Giuseppe Ottaviano,et al.  Compressing Graphs and Indexes with Recursive Graph Bisection , 2016, KDD.

[20]  Jure Leskovec,et al.  The dynamics of viral marketing , 2005, EC '06.

[21]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[22]  Christos Faloutsos,et al.  Interestingness-Driven Diffusion Process Summarization in Dynamic Networks , 2014, ECML/PKDD.

[23]  Qi He,et al.  Distributed Graph Summarization , 2014, CIKM.

[24]  Alberto Apostolico,et al.  Graph Compression by BFS , 2009, Algorithms.

[25]  Daniel J. Abadi,et al.  Scalable Pattern Matching over Compressed Graphs via Dedensification , 2016, KDD.

[26]  Ben Shneiderman,et al.  Motif simplification: improving network visualization readability with fan, connector, and clique glyphs , 2013, CHI.

[27]  Sebastiano Vigna,et al.  Graph structure in the web --- revisited: a trick of the heavy tail , 2014, WWW.

[28]  Imdadullah Khan,et al.  Scalable Approximation Algorithm for Graph Summarization , 2018, PAKDD.

[29]  Chao Liu,et al.  BBM: bayesian browsing model from petabyte-scale data , 2009, KDD.

[30]  W. Xiong,et al.  Graph summarization for attributed graphs , 2014, 2014 International Conference on Information Science, Electronics and Electrical Engineering.

[31]  Doug Lea Concurrent Programming in Java. Second Edition: Design Principles and Patterns , 1999 .

[32]  Craig A. Knoblock,et al.  Unsupervised Entity Resolution on Multi-type Graphs , 2016, SEMWEB.

[33]  Ryan A. Rossi,et al.  GraphZIP: a clique-based sparse graph compression method , 2018, Journal of Big Data.

[34]  Arie Shoshani,et al.  Optimizing bitmap indices with efficient compression , 2006, TODS.

[35]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[36]  Danai Koutra,et al.  TimeCrunch: Interpretable Dynamic Graph Summarization , 2015, KDD.

[37]  Nisheeth Shrivastava,et al.  Graph summarization with bounded error , 2008, SIGMOD Conference.

[38]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[39]  Torsten Hoefler,et al.  Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations , 2018, ArXiv.

[40]  Harold N. Gabow,et al.  An efficient reduction technique for degree-constrained subgraph and bidirected network flow problems , 1983, STOC.

[41]  Young-Koo Lee,et al.  Set-Based Unified Approach for Attributed Graph Summarization , 2014, 2014 IEEE Fourth International Conference on Big Data and Cloud Computing.

[42]  Alex Thomo,et al.  Zero-knowledge private graph summarization , 2013, 2013 IEEE International Conference on Big Data.

[43]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[44]  Qing Chen,et al.  Graph Stream Summarization: From Big Bang to Big Crunch , 2016, SIGMOD Conference.

[45]  Tina Eliassi-Rad,et al.  Visual Analysis of Large Heterogeneous Social Networks by Semantic and Structural Abstraction , 2006 .

[46]  Doug Lea,et al.  Concurrent programming in Java - design principles and patterns , 1996, Java series.

[47]  Fan Chung Graham,et al.  Duplication Models for Biological Networks , 2002, J. Comput. Biol..

[48]  Young-Koo Lee,et al.  Set-based approximate approach for lossless graph summarization , 2015, Computing.

[49]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[50]  Shou-De Lin,et al.  Egocentric Information Abstraction for Heterogeneous Social Networks , 2009, 2009 International Conference on Advances in Social Network Analysis and Mining.

[51]  Young-Koo Lee,et al.  An effective graph summarization and compression technique for a large-scaled graph , 2018, The Journal of Supercomputing.

[52]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[53]  Jiawei Han,et al.  Mining Graph Patterns Efficiently via Randomized Summaries , 2009, Proc. VLDB Endow..

[54]  Jignesh M. Patel,et al.  Discovery-driven graph summarization , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[55]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[56]  Xin Wang,et al.  Diversified Top-k Graph Pattern Matching , 2013, Proc. VLDB Endow..

[57]  Francesco Bonchi,et al.  Graph summarization with quality guarantees , 2014, 2014 IEEE International Conference on Data Mining.