Universal Graph Compression: Stochastic Block Models

Motivated by the prevalent data science applications of processing and mining large-scale graph data such as social networks, web graphs, and biological networks, as well as the high I/O and communication costs of storing and transmitting such data, this paper investigates lossless compression of data appearing in the form of a labeled graph. A universal graph compression scheme is proposed, which does not depend on the underlying statistics/distribution of the graph model. For graphs generated by a stochastic block model, which is a widely used random graph model capturing the clustering effects in social networks, the proposed scheme achieves the optimal theoretical limit of lossless compression without the need to know edge probabilities, community labels, or the number of communities. The key ideas in establishing universality for stochastic block models include: 1) block decomposition of the adjacency matrix of the graph; 2) generalization of the Krichevsky-Trofimov probability assignment, which was initially designed for i.i.d. random processes. In four benchmark graph datasets (protein-to-protein interaction, LiveJournal friendship, Flickr, and YouTube), the compressed files from competing algorithms (including CSR, Ligra+, PNG image compressor, and Lempel-Ziv compressor for two-dimensional data) take 2.4 to 27 times the space needed by the proposed scheme.

[1]  György Turán,et al.  On the succinct representation of graphs , 1984, Discret. Appl. Math..

[2]  Tatsuya Akutsu,et al.  Comparing biological networks via graph compression , 2010, BMC Systems Biology.

[3]  Andrew R. Barron,et al.  Minimax redundancy for the class of memoryless sources , 1997, IEEE Trans. Inf. Theory.

[4]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[5]  Emmanuel Abbe,et al.  Exact Recovery in the Stochastic Block Model , 2014, IEEE Transactions on Information Theory.

[6]  Venkat Anantharam,et al.  Universal lossless compression of graphical data , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[7]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[8]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[9]  Ryan A. Rossi,et al.  GraphZIP: a clique-based sparse graph compression method , 2018, Journal of Big Data.

[10]  Gonzalo Navarro,et al.  k2-Trees for Compact Web Graph Representation , 2009, SPIRE.

[11]  Moni Naor Succinct representation of general unlabeled graphs , 1990, Discret. Appl. Math..

[12]  Wojciech Szpankowski,et al.  Compression of Graphical Structures: Fundamental Limits, Algorithms, and Experiments , 2012, IEEE Transactions on Information Theory.

[13]  Sergio Verdú,et al.  Compressing data on graphs with clusters , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[14]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[15]  Guy E. Blelloch,et al.  Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[16]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[17]  M. Narasimha Murty,et al.  Structural Neighborhood Based Classification of Nodes in a Network , 2016, KDD.

[18]  Elchanan Mossel,et al.  Reconstruction and estimation in the planted partition model , 2012, Probability Theory and Related Fields.

[19]  J. Ian Munro,et al.  Succinct encoding of arbitrary graphs , 2013, Theor. Comput. Sci..

[20]  A. Frieze,et al.  Introduction to Random Graphs , 2016 .

[21]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[22]  A. Rinaldo,et al.  Random networks, graphical models and exchangeability , 2017, 1701.08420.

[23]  Abraham Lempel,et al.  Compression of two-dimensional data , 1986, IEEE Trans. Inf. Theory.

[24]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.

[25]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[26]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[27]  Andrew R. Barron,et al.  Asymptotic minimax regret for data compression, gambling, and prediction , 1997, IEEE Trans. Inf. Theory.

[28]  Torsten Hoefler,et al.  Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations , 2018, ArXiv.

[29]  Heiko Schwarz,et al.  Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[30]  Gonzalo Navarro,et al.  Compressing Web Graphs like Texts ∗ , 2007 .

[31]  F. Alajaji,et al.  Lectures Notes in Information Theory , 2000 .

[32]  Emmanuel Abbe,et al.  Graph compression: The effect of clusters , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[33]  Christos Faloutsos,et al.  SlashBurn: Graph Compression and Mining beyond Caveman Communities , 2014, IEEE Transactions on Knowledge and Data Engineering.

[34]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[35]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[36]  Emmanuel Abbe,et al.  Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.