pm-SCAN: an I/O Efficient Structural Clustering Algorithm for Large-scale Graphs

Most existing algorithms for graph clustering, including SCAN, are not designed to cope with large volumes of data that cannot fit in main memory. When there is not enough memory, those algorithms will incur thrashing, i.e. result in huge I/O costs. We propose an I/O-efficient algorithm for structural clustering, pm-SCAN. The main idea of our scheme is to partition a large graph into several subgraphs that can fit into main memory. We first find clusters in each subgraph, and then merge them to produce final clustering of the input graph. Experimental results show that while other existing algorithms are not scalable to the graph size, our proposed method produces scalable performance for limited memory space.

[1]  Yufei Tao,et al.  Massive graph triangulation , 2013, SIGMOD '13.

[2]  Lu Qin,et al.  pSCAN: Fast and exact structural graph clustering , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[3]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[4]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[5]  Srinivasan Parthasarathy,et al.  Efficient community detection in large networks using content and links , 2012, WWW.

[6]  Dmitri Loguinov,et al.  On Efficient External-Memory Triangle Listing , 2019, IEEE Transactions on Knowledge and Data Engineering.

[7]  Yasuhiro Fujiwara,et al.  SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs , 2015, Proc. VLDB Endow..

[8]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[9]  Sebastiano Vigna,et al.  BUbiNG: massive crawling for the masses , 2014, WWW.

[10]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[11]  James Cheng,et al.  Triangle listing in massive networks , 2012, TKDD.

[12]  Thomas Seidl,et al.  Efficient Mining of Combined Subspace and Subgraph Clusters in Graphs with Feature Vectors , 2013, PAKDD.

[13]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.