Parallel Index-Based Structural Graph Clustering and Its Approximation

SCAN (Structural Clustering Algorithm for Networks) is a wellstudied, widely used graph clustering algorithm. For large graphs, however, sequential SCAN variants are prohibitively slow, and parallel SCAN variants do not effectively share work among queries with different SCAN parameter settings. Since users of SCAN often explore many parameter settings to find good clusterings, it is worthwhile to precompute an index that speeds up queries. This paper presents a practical and provably efficient parallel index-based SCAN algorithm based on GS*-Index, a recent sequential algorithm. Our parallel algorithm improves upon the asymptotic work of the sequential algorithm by using integer sorting. It is also highly parallel; it achieves logarithmic span for both index construction and clustering queries. Furthermore, we apply locality-sensitive hashing (LSH) to design a novel approximate SCAN algorithm and prove guarantees for its clustering quality. We present an experimental evaluation of our parallel algorithms on large real-world graphs. On a 48-core machine with two-way hyper-threading, our parallel index construction achieves 50–151× speedup over the construction of GS*-Index. In fact, even on a single thread, our index construction algorithm is faster than GS*-Index. Our parallel index query implementation achieves 5–32× speedup over GS*-Index queries across a range of SCAN parameter values, and our implementation is always faster than ppSCAN, a state-of-theart parallel SCAN algorithm. Moreover, our experiments show that applying LSH results in much faster index construction on denser graphs without large sacrifices in clustering quality.

[1]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[2]  Jure Leskovec,et al.  Local Higher-Order Graph Clustering , 2017, KDD.

[3]  Weida Tong,et al.  Translating Clinical Findings into Knowledge in Drug Safety Evaluation - Drug Induced Liver Injury Prediction System (DILIps) , 2011, PLoS Comput. Biol..

[4]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[5]  Julian Shun,et al.  Multicore triangle computations without tuning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[6]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7]  Yiannis Kompatsiaris,et al.  A Graph-Based Clustering Scheme for Identifying Related Tags in Folksonomies , 2010, DaWak.

[8]  Gang Chen,et al.  AnySCAN: An Efficient Anytime Framework with Active Learning for Large-Scale Network Clustering , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[9]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[10]  Ping Li,et al.  In Defense of Minhash over Simhash , 2014, AISTATS.

[11]  Kyomin Jung,et al.  LinkSCAN*: Overlapping community detection using the link-space transformation , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[12]  Jiawei Han,et al.  Progressive clustering of networks using Structure-Connected Order of Traversal , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Jingbin Wang,et al.  SparkSCAN: A Structure Similarity Clustering Algorithm on Spark , 2015 .

[14]  Xiaowei Xu,et al.  A Divisive Hierarchical Structural Clustering Algorithm for Networks , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[15]  Yizhou Sun,et al.  SHRINK: a structural clustering algorithm for detecting hierarchical communities in networks , 2010, CIKM.

[16]  Yiannis Kompatsiaris,et al.  Multimodal Graph-based Event Detection and Summarization in Social Media Streams , 2015, ACM Multimedia.

[17]  Rajeev Raman,et al.  The Power of Collision: Randomized Parallel Algorithms for Chaining and Integer Sorting , 1990, FSTTCS.

[18]  Ping Li,et al.  One Permutation Hashing , 2012, NIPS.

[19]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Silvio Lattanzi,et al.  Affinity Clustering: Hierarchical Clustering at Scale , 2017, NIPS.

[21]  Stephen C. Harris,et al.  atBioNet– an integrated network analysis tool for genomics and biomarker discovery , 2012, BMC Genomics.

[22]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[23]  Sebastiano Vigna,et al.  The webgraph framework I: compression techniques , 2004, WWW '04.

[24]  Yasuhiro Fujiwara,et al.  SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Large-scale Graphs , 2015, Proc. VLDB Endow..

[25]  Athena Vakali,et al.  Leveraging Collective Intelligence through Community Detection in Tag Networks ∗ , 2009 .

[26]  Guy E. Blelloch,et al.  Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable , 2018, SPAA.

[27]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[28]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[29]  Lijun Chang,et al.  Efficient structural graph clustering: an index-based approach , 2017, The VLDB Journal.

[30]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[31]  Uzi Vishkin,et al.  Towards a theory of nearly constant time parallel algorithms , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  Ge Yu,et al.  DPSCAN: Structural Graph Clustering Based on Density Peaks , 2019, DASFAA.

[34]  Bin Li,et al.  A Review for Weighted MinHash Algorithms , 2018, IEEE Transactions on Knowledge and Data Engineering.

[35]  Yiannis Kompatsiaris,et al.  Visual Event Summarization on Social Media using Topic Modelling and Graph-based Ranking Algorithms , 2015, ICMR.

[36]  Qinbao Song,et al.  Revealing Density-Based Clustering Structure from the Core-Connected Tree of a Network , 2013, IEEE Transactions on Knowledge and Data Engineering.

[37]  Marco Rosa,et al.  Layered label propagation: a multiresolution coordinate-free ordering for compressing social networks , 2010, WWW.

[38]  Xiaowei Xu,et al.  A structural approach for finding functional modules from large biological networks , 2008, BMC Bioinformatics.

[39]  Gary L. Miller,et al.  Graph Partitioning by Spectral Rounding: Applications in Image Segmentation and Clustering , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[40]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[41]  Lu Qin,et al.  pSCAN: Fast and exact structural graph clustering , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[42]  Qiong Luo,et al.  Parallelizing Pruning-based Graph Structural Clustering , 2018, ICPP.

[43]  Sinan Kockara,et al.  GPUSCAN: GPU-Based Parallel Structural Clustering Algorithm for Networks , 2015, IEEE Transactions on Parallel and Distributed Systems.

[44]  Uzi Vishkin,et al.  Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques , 2008 .

[45]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[46]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[47]  Hillel Gazit,et al.  An optimal randomized parallel algorithm for finding connected components in a graph , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[48]  Jiawei Han,et al.  Hierarchical Web-Page Clustering via In-Page and Cross-Page Link Structures , 2010, PAKDD.

[49]  Alejandro Bellogín,et al.  Using graph partitioning techniques for neighbour selection in user-based collaborative filtering , 2012, RecSys.

[50]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[51]  Julian Shun,et al.  Parallel Batch-Dynamic k-Clique Counting , 2020, APOCS.

[52]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[53]  Norishige Chiba,et al.  Arboricity and Subgraph Listing Algorithms , 1985, SIAM J. Comput..

[54]  Xiaowei Xu,et al.  SCAN: a structural clustering algorithm for networks , 2007, KDD '07.

[55]  Dimitris S. Papailiopoulos,et al.  Parallel Correlation Clustering on Big Graphs , 2015, NIPS.

[56]  Xiaowei Xu,et al.  AHSCAN: Agglomerative Hierarchical Structural Clustering Algorithm for Networks , 2009, 2009 International Conference on Advances in Social Network Analysis and Mining.

[57]  Amos Fiat,et al.  Correlation clustering in general weighted graphs , 2006, Theor. Comput. Sci..

[58]  Weizhong Zhao,et al.  PSCAN: A Parallel Structural Clustering Algorithm for Big Networks in MapReduce , 2013, 2013 IEEE 27th International Conference on Advanced Information Networking and Applications (AINA).

[59]  M. E. Muller,et al.  A Note on the Generation of Random Normal Deviates , 1958 .

[60]  Hiroyuki Kitagawa,et al.  SCAN-XP: Parallel Structural Graph Clustering Algorithm on Intel Xeon Phi Coprocessors , 2017, NDA@SIGMOD.

[61]  Henri Casanova,et al.  Parallel Algorithms , 2019, Design and Analysis of Algorithms.

[62]  Sihem Amer-Yahia,et al.  Scalable Interactive Dynamic Graph Clustering on Multicore CPUs , 2019, IEEE Transactions on Knowledge and Data Engineering.

[63]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[64]  Xiaowei Xu,et al.  Constructing a robust protein-protein interaction network by integrating multiple public databases , 2011, BMC Bioinformatics.

[65]  M. Newman Analysis of weighted networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[66]  Guy E. Blelloch,et al.  ParlayLib - A Toolkit for Parallel Algorithms on Shared-Memory Multicore Machines , 2020, SPAA.

[67]  Guy E. Blelloch,et al.  Phase-concurrent hash tables for determinism , 2014, SPAA.

[68]  Yiannis Kompatsiaris,et al.  Image clustering through community detection on hybrid image similarity graphs , 2010, 2010 IEEE International Conference on Image Processing.

[69]  Jiajun Chen,et al.  PSCAN: A Parallel Structural Clustering Algorithm for networks , 2013, 2013 International Conference on Machine Learning and Cybernetics.

[70]  M. Cugmas,et al.  On comparing partitions , 2015 .

[71]  Georgios A. Pavlopoulos,et al.  HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks , 2018, Nucleic acids research.

[72]  Richard Cole,et al.  Parallel merge sort , 1988, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[73]  Daniel S. Himmelstein,et al.  Understanding multicellular function and disease with human tissue-specific networks , 2015, Nature Genetics.