Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling

Extracting dense subgraphs from large graphs is a key primitive in a variety of graph mining applications, ranging from mining social networks and the Web graph to bioinformatics [41]. In this paper we focus on a family of poly-time solvable formulations, known as the k-clique densest subgraph problem (k-Clique-DSP) [57]. When k=2, the problem becomes the well-known densest subgraph problem (DSP) [22, 31, 33, 39]. Our main contribution is a sampling scheme that gives densest subgraph sparsifier, yielding a randomized algorithm that produces high-quality approximations while providing significant speedups and improved space complexity. We also extend this family of formulations to bipartite graphs by introducing the (p,q)-biclique densest subgraph problem ((p,q)-Biclique-DSP), and devise an exact algorithm that can treat both clique and biclique densities in a unified way. As an example of performance, our sparsifying algorithm extracts the 5-clique densest subgraph --which is a large-near clique on 62 vertices-- from a large collaboration network. Our algorithm achieves 100% accuracy over five runs, while achieving an average speedup factor of over 10,000. Specifically, we reduce the running time from ∼2 107 seconds to an average running time of 0.15 seconds. We also use our methods to study how the k-clique densest subgraphs change as a function of time in time-evolving networks for various small values of k. We observe significant deviations between the experimental findings on real-world networks and stochastic Kronecker graphs, a random graph model that mimics real-world networks in certain aspects. We believe that our work is a significant advance in routines with rigorous theoretical guarantees for scalable extraction of large near-cliques from networks.

[1]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[2]  Ankur Moitra,et al.  Approximation Algorithms for Multicommodity-Type Problems with Guarantees Independent of the Graph Size , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[3]  Andrew V. Goldberg,et al.  Finding a Maximum Density Subgraph , 1984 .

[4]  Christos Faloutsos,et al.  Kronecker Graphs: An Approach to Modeling Networks , 2008, J. Mach. Learn. Res..

[5]  Irene Finocchi,et al.  Counting small cliques in MapReduce , 2014, ArXiv.

[6]  Frank Thomson Leighton,et al.  Extensions and limits to vertex sparsification , 2010, STOC '10.

[7]  Gregory Buehrer,et al.  A scalable pattern mining approach to web graph compression with communities , 2008, WSDM '08.

[8]  Francesco Bonchi,et al.  Finding Subgraphs with Maximum Total Density and Limited Overlap , 2015, WSDM.

[9]  Ümit V. Çatalyürek,et al.  Finding Hierarchical and Overlapping Dense Subgraphs using Nucleus Decompositions , 2014 .

[10]  Alexandr Andoni,et al.  Towards (1 + ∊)-Approximate Flow Sparsifiers , 2013, SODA.

[11]  Ümit V. Çatalyürek,et al.  Finding the Hierarchy of Dense Subgraphs using Nucleus Decompositions , 2014, WWW.

[12]  TanKian-Lee,et al.  On triangulation-based dense neighborhood graph discovery , 2010, VLDB 2010.

[13]  Ravi Kumar,et al.  Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[14]  Johan Håstad,et al.  Clique is hard to approximate within n/sup 1-/spl epsiv// , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[15]  Yang Xiang,et al.  3-HOP: a high-compression indexing scheme for reachability query , 2009, SIGMOD Conference.

[16]  Norishige Chiba,et al.  Arboricity and Subgraph Listing Algorithms , 1985, SIAM J. Comput..

[17]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[18]  Charalampos E. Tsourakakis Mathematical and Algorithmic Analysis of Network and Biological Data , 2014, ArXiv.

[19]  Divesh Srivastava,et al.  Dense subgraph maintenance under streaming edge weight updates for real-time story identification , 2013, The VLDB Journal.

[20]  Arild Stubhaug Acta Mathematica , 1886, Nature.

[21]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[22]  Serafim Batzoglou,et al.  MotifCut: regulatory motifs finding with maximum density subgraphs , 2006, ISMB.

[23]  David R. Karger,et al.  Approximating s-t minimum cuts in Õ(n2) time , 1996, STOC '96.

[24]  Charalampos E. Tsourakakis,et al.  Denser than the densest subgraph: extracting optimal quasi-cliques with quality guarantees , 2013, KDD.

[25]  Andrew V. Goldberg,et al.  A new approach to the maximum flow problem , 1986, STOC '86.

[26]  Moses Charikar,et al.  Greedy approximation algorithms for finding dense components in a graph , 2000, APPROX.

[27]  Yin Tat Lee,et al.  Path Finding Methods for Linear Programming: Solving Linear Programs in Õ(vrank) Iterations and Faster Algorithms for Maximum Flow , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[28]  Samir Khuller,et al.  On Finding Dense Subgraphs , 2009, ICALP.

[29]  James B. Orlin,et al.  A faster strongly polynomial time algorithm for submodular function minimization , 2007, Math. Program..

[30]  Francesco Bonchi,et al.  Core decomposition of uncertain graphs , 2014, KDD.

[31]  David Eppstein,et al.  Listing All Maximal Cliques in Sparse Graphs in Near-optimal Time , 2010, Exact Complexity of NP-hard Problems.

[32]  Yousef Saad,et al.  Dense Subgraph Extraction with Application to Community Detection , 2012, IEEE Transactions on Knowledge and Data Engineering.

[33]  Mikkel Thorup,et al.  Approximate distance oracles , 2001, JACM.

[34]  Andrew V. Goldberg,et al.  On Implementing the Push—Relabel Method for the Maximum Flow Problem , 1997, Algorithmica.

[35]  David Eppstein,et al.  Arboricity and Bipartite Subgraph Listing Algorithms , 1994, Inf. Process. Lett..

[36]  Robert E. Tarjan,et al.  A Fast Parametric Maximum Flow Algorithm and Applications , 1989, SIAM J. Comput..

[37]  Kumar Chellapilla,et al.  Finding Dense Subgraphs with Size Bounds , 2009, WAW.

[38]  J. Jeffry Howbert,et al.  The Maximum Clique Problem , 2007 .

[39]  Christos Faloutsos,et al.  Graphs over time: densification laws, shrinking diameters and possible explanations , 2005, KDD '05.

[40]  Vladimir Batagelj,et al.  An O(m) Algorithm for Cores Decomposition of Networks , 2003, ArXiv.

[41]  Hisao Tamaki,et al.  Greedily Finding a Dense Subgraph , 2000, J. Algorithms.

[42]  Kazuhisa Makino,et al.  New Algorithms for Enumerating All Maximal Cliques , 2004, SWAT.

[43]  Charalampos E. Tsourakakis The K-clique Densest Subgraph Problem , 2015, WWW.

[44]  M. Trick,et al.  Cliques, Coloring, and Satisfiability: Second DIMACS Implementation Challenge, Workshop, October 11-13, 1993 , 1996 .

[45]  David S. Johnson,et al.  Cliques, Coloring, and Satisfiability , 1996 .

[46]  Takeaki Uno,et al.  An Efficient Algorithm for Solving Pseudo Clique Enumeration Problem , 2008, Algorithmica.

[47]  J. Håstad Clique is hard to approximate within n 1-C , 1996 .

[48]  Uriel Feige,et al.  The Dense k -Subgraph Problem , 2001, Algorithmica.

[49]  Sandra Sudarsky,et al.  Massive Quasi-Clique Detection , 2002, LATIN.

[50]  Sergei Vassilvitskii,et al.  Densest Subgraph in Streaming and MapReduce , 2012, Proc. VLDB Endow..

[51]  Charu C. Aggarwal,et al.  A Survey of Algorithms for Dense Subgraph Discovery , 2010, Managing and Mining Graph Data.

[52]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[53]  Mikkel Thorup,et al.  Spanners and emulators with sublinear distance errors , 2006, SODA '06.

[54]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[55]  Mihail N. Kolountzakis,et al.  Triangle Sparsifiers , 2011, J. Graph Algorithms Appl..

[56]  Charalampos E. Tsourakakis,et al.  Colorful triangle counting and a MapReduce implementation , 2011, Inf. Process. Lett..

[57]  Aditya Bhaskara,et al.  Detecting high log-densities: an O(n¼) approximation for densest k-subgraph , 2010, STOC '10.

[58]  Charalampos E. Tsourakakis,et al.  Space- and Time-Efficient Algorithm for Maintaining Dense Subgraphs on One-Pass Dynamic Streams , 2015, STOC.