Distributed Enumeration of Four Node Graphlets at Quadrillion-Scale

Graphlet enumeration is a basic task in graph analysis with many applications. Thus it is important to be able to perform this task within a reasonable amount of time. However, this objective is challenging when the input graph is very large, with millions of nodes and edges. Known solutions are limited in terms of scalability. Distributed computing is often proposed as a solution to improve scalability. However, it has to be done carefully to reduce the overhead cost and to really benefit from the distributed solution. We study the enumeration of four-node graphlets in undirected graphs using a distributed platform. We propose an efficient distributed solution which significantly surpasses the existing solutions. With this method we are able to process larger graphs that have never been processed before and enumerate quadrillions of graphlets using a modest cluster of machines. We show the scalability of our solution through experimental results. Finally, we also extend our algorithm to enumerate graphlets in probabilistic graphs and demonstrate its suitability for this case.

[1]  Janez Demsar,et al.  A combinatorial approach to graphlet counting , 2014, Bioinform..

[2]  Michael Isard,et al.  Scalability! But at what COST? , 2015, HotOS.

[3]  V. S. Subrahmanian,et al.  COSI: Cloud Oriented Subgraph Identification in Massive Social Networks , 2010, 2010 International Conference on Advances in Social Networks Analysis and Mining.

[4]  George Karypis,et al.  Frequent substructure-based approaches for classifying chemical compounds , 2003, IEEE Transactions on Knowledge and Data Engineering.

[5]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[6]  Yuval Shavitt,et al.  RAGE - A rapid graphlet enumerator for large networks , 2012, Comput. Networks.

[7]  Laks V. S. Lakshmanan,et al.  LINC: A Motif Counting Algorithm for Uncertain Graphs , 2019, Proc. VLDB Endow..

[8]  Pierre Baldi,et al.  Graph kernels for chemical informatics , 2005, Neural Networks.

[9]  Jia Wang,et al.  Truss Decomposition in Massive Networks , 2012, Proc. VLDB Endow..

[10]  James Cheng,et al.  G-Miner: an efficient task-oriented graph mining system , 2018, EuroSys.

[11]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[12]  Tijana Milenkoviæ,et al.  Uncovering Biological Network Function via Graphlet Degree Signatures , 2008, Cancer informatics.

[13]  Mohammed J. Zaki,et al.  A distributed approach for graph mining in massive networks , 2016, Data Mining and Knowledge Discovery.

[14]  Ali Pinar,et al.  ESCAPE: Efficiently Counting All 5-Vertex Subgraphs , 2016, WWW.

[15]  Tongping Liu,et al.  GraphZero: Breaking Symmetry for Efficient Graph Mining , 2019, ArXiv.

[16]  Stefano Leucci,et al.  Motivo: Fast Motif Counting via Succinct Color Coding and Adaptive Sampling , 2019, Proc. VLDB Endow..

[17]  Jeffrey Xu Yu,et al.  Fast and Robust Distributed Subgraph Enumeration , 2019, Proc. VLDB Endow..

[18]  Ryan A. Rossi,et al.  Efficient Graphlet Counting for Large Networks , 2015, 2015 IEEE International Conference on Data Mining.

[19]  Mario Vento,et al.  A (sub)graph isomorphism algorithm for matching large graphs , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Srinivasan Parthasarathy,et al.  Fractal: A General-Purpose Graph Pattern Mining System , 2019, SIGMOD Conference.

[21]  Maximilien Danisch,et al.  Listing k-cliques in Sparse Real-World Graphs* , 2018, WWW.

[22]  Katherine Faust,et al.  A puzzle concerning triads in social networks: Graph constraints and the triad census , 2010, Soc. Networks.

[23]  Hao Zhang,et al.  Distributed Subgraph Counting: A General Approach , 2020, Proc. VLDB Endow..

[24]  Ravi Kumar,et al.  Counting Graphlets: Space vs Time , 2017, WSDM.

[25]  Alex Thomo,et al.  Nucleus Decomposition in Probabilistic Graphs: Hardness and Algorithms , 2020, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[26]  Sung-Hyon Myaeng,et al.  PTE: Enumerating Trillion Triangles On Distributed Systems , 2016, KDD.

[27]  John C. S. Lui,et al.  G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[28]  Mohammad Al Hasan,et al.  Graft: An Efficient Graphlet Counting Method for Large Graph Analysis , 2014, IEEE Transactions on Knowledge and Data Engineering.

[29]  Matthieu Latapy,et al.  Main-memory triangle computations for very large (sparse (power-law)) graphs , 2008, Theor. Comput. Sci..

[30]  Sebastian Wernicke,et al.  FANMOD: a tool for fast network motif detection , 2006, Bioinform..

[31]  Jeffrey Xu Yu,et al.  Distributed subgraph counting , 2020, VLDB 2020.

[32]  Sung-Hyon Myaeng,et al.  Enumerating Trillion Subgraphs On Distributed Systems , 2018, ACM Trans. Knowl. Discov. Data.

[33]  Jiawei Han,et al.  Mining coherent dense subgraphs across massive biological networks for functional discovery , 2005, ISMB.

[34]  Tamer Kahveci,et al.  Counting motifs in probabilistic biological networks , 2015, BCB.

[35]  Dorothea Wagner,et al.  Finding, Counting and Listing All Triangles in Large Graphs, an Experimental Study , 2005, WEA.

[36]  Alex Thomo,et al.  Efficient Enumeration of Four Node Graphlets at Trillion-Scale , 2020, EDBT.

[37]  Mohammed J. Zaki,et al.  Arabesque: a system for distributed graph mining , 2015, SOSP.

[38]  Nick Cercone,et al.  Comparative network analysis via differential graphlet communities , 2014, Proteomics.