SparkSCAN: A Structure Similarity Clustering Algorithm on Spark

The existing directed graph clustering algorithms are born with some problems such as high latency, resource depletion and poor performance of iterative data processing. A distributed parallel algorithm of structure similarity clustering on Spark (SparkSCAN) is proposed to solve these problems: considering the interaction between nodes in the network, the similar structure of nodes are clustered together; Aiming at the large-scale characteristics of directed graphs, a data structure suitable for distributed graph computing is designed, and a distributed parallel clustering algorithm is proposed based on Spark framework, which improves the processing performance on the premise of the accuracy of clustering results. The experimental results show that the SparkSCAN have a good performance, and can effectively deal with the problem of clustering algorithm for large-scale directed graph.