Universal Graph Compression: Stochastic Block Models

Motivated by the prevalent data science applications of processing large-scale graph data such as social networks, web graphs, and biological networks, as well as the high I/O and communication costs of storing and transmitting such data, this paper investigates universal compression of data appearing in the form of a labeled graph. In particular, we consider a widely used random graph model, stochastic block model (SBM), which captures the clustering effects in social networks. A universal graph compressor is proposed, which achieves the optimal compression rate for a wide family of SBMs with edge probabilities from $O$(1) to Ω(1/$n$ 2-∊) for any 0 < ∊ < 1. Existing universal compression techniques are developed mostly for stationary ergodic one-dimensional sequences with entropy linear in the number of variables. However, the adjacency matrix of SBM has complex two-dimensional correlations and sublinear entropy in the sparse regime. These challenges are alleviated through a carefully designed transform that converts two-dimensional correlated data into almost i.i.d. blocks. The blocks are then compressed by a Krichevsky-Trofimov compressor, whose length analysis is generalized to arbitrarily correlated processes with identical marginals.

[1]  Tatsuya Akutsu,et al.  Comparing biological networks via graph compression , 2010, BMC Systems Biology.

[2]  J. Ian Munro,et al.  Succinct encoding of arbitrary graphs , 2013, Theor. Comput. Sci..

[3]  Wojciech Szpankowski,et al.  Compression of Graphical Structures: Fundamental Limits, Algorithms, and Experiments , 2012, IEEE Transactions on Information Theory.

[4]  Venkat Anantharam,et al.  Universal lossless compression of graphical data , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[5]  A. Frieze,et al.  Introduction to Random Graphs , 2016 .

[6]  Guy E. Blelloch,et al.  Smaller and Faster: Parallel Processing of Compressed Graphs with Ligra+ , 2015, 2015 Data Compression Conference.

[7]  Emmanuel Abbe,et al.  Exact Recovery in the Stochastic Block Model , 2014, IEEE Transactions on Information Theory.

[8]  M. Narasimha Murty,et al.  Structural Neighborhood Based Classification of Nodes in a Network , 2016, KDD.

[9]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[10]  Moni Naor Succinct representation of general unlabeled graphs , 1990, Discret. Appl. Math..

[11]  Gonzalo Navarro,et al.  Compressing Web Graphs like Texts ∗ , 2007 .

[12]  Thomas C. Conway,et al.  Succinct data structures for assembling large genomes , 2010, Bioinform..

[13]  F. Alajaji,et al.  Lectures Notes in Information Theory , 2000 .

[14]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[15]  Gonzalo Navarro,et al.  k2-Trees for Compact Web Graph Representation , 2009, SPIRE.

[16]  Emmanuel Abbe,et al.  Community Detection in General Stochastic Block models: Fundamental Limits and Efficient Algorithms for Recovery , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[17]  György Turán,et al.  On the succinct representation of graphs , 1984, Discret. Appl. Math..

[18]  Emmanuel Abbe,et al.  Graph compression: The effect of clusters , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[19]  Abraham Lempel,et al.  Compression of two-dimensional data , 1986, IEEE Trans. Inf. Theory.

[20]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[21]  Silvio Lattanzi,et al.  On compressing social networks , 2009, KDD.

[22]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[23]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[24]  Michelle Effros,et al.  Universal lossless source coding with the Burrows Wheeler transform , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[25]  Andrew R. Barron,et al.  Asymptotic minimax regret for data compression, gambling, and prediction , 1997, IEEE Trans. Inf. Theory.

[26]  Andrew R. Barron,et al.  Minimax redundancy for the class of memoryless sources , 1997, IEEE Trans. Inf. Theory.

[27]  Elchanan Mossel,et al.  Reconstruction and estimation in the planted partition model , 2012, Probability Theory and Related Fields.

[28]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[29]  Frans M. J. Willems,et al.  The context-tree weighting method: basic properties , 1995, IEEE Trans. Inf. Theory.

[30]  Torsten Hoefler,et al.  Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations , 2018, ArXiv.

[31]  Ryan A. Rossi,et al.  GraphZIP: a clique-based sparse graph compression method , 2018, Journal of Big Data.

[32]  Sergio Verdú,et al.  Compressing data on graphs with clusters , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[33]  Heiko Schwarz,et al.  Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[34]  A. Rinaldo,et al.  Random networks, graphical models and exchangeability , 2017, 1701.08420.

[35]  Christos Faloutsos,et al.  SlashBurn: Graph Compression and Mining beyond Caveman Communities , 2014, IEEE Transactions on Knowledge and Data Engineering.

[36]  Guy E. Blelloch,et al.  Ligra: a lightweight graph processing framework for shared memory , 2013, PPoPP '13.