k-Nearest Neighbor based Clustering with Shape Alternation Adaptivity

Existing clustering algorithms aim at identifying clusters from a single dataset. However, many applications generate a series of datasets. For example, scientists need to repeat an experiment many times to ensure reproducibility; sensors collect information day after day. In such scenarios, we need to identify clusters separately from a large number of datasets, which can contain an unknown number of clusters with various densities and shapes.Density-based clustering algorithms are commonly used in identifying arbitrary shaped clusters when the cluster number is unknown. Most density-based clustering algorithms are "DBSCAN-alike", where clusters are formed by connecting consecutive high dense regions. Therefore, points are grouped as one cluster as long as they are densely connected. When the distribution shape of points is changed across different datasets, parameter tuning on each dataset is necessary to obtain proper results, which is time-consuming.In this work, we developed a new kNN density-based clustering algorithm, which does not adopt the DBSCAN paradigm. Instead, we identify clusters by maximizing the intra-cluster similarities, which are estimated using: 1) the probability that two points belong to the same cluster; 2) the probability that a point is a cluster center. The kNN concept and minimum spanning tree are used to compute both probabilities. Our approach is capable of extracting clusters in arbitrary shapes using the single parameter k, and can handle a series of datasets with less parameter tuning effort. Experiments on both synthetic and real-world datasets show that our approach outperforms other recent kNN clustering algorithms.

[1]  Tarald O. Kvålseth,et al.  Entropy and Correlation: Some Comments , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Lian Duan,et al.  A Local Density Based Spatial Clustering Algorithm with Noise , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[4]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[5]  Alfredo Ferro,et al.  Enhancing density-based clustering: Parameter reduction and outlier detection , 2013, Inf. Syst..

[6]  Lutgarde M. C. Buydens,et al.  KNN-kernel density-based clustering for high-dimensional multivariate data , 2006, Comput. Stat. Data Anal..

[7]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .

[8]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[9]  Tinghuai Ma,et al.  An efficient and scalable density-based clustering algorithm for datasets with complex structures , 2016, Neurocomputing.

[10]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Lida Xu,et al.  A local-density based spatial clustering algorithm with noise , 2007, Inf. Syst..

[12]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[13]  Kamalakar Karlapalem,et al.  A Simple Yet Effective Data Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[14]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[15]  Yonggang Lu,et al.  Density-based Clustering using Automatic Density Peak Detection , 2018, ICPRAM.

[16]  M. Cugmas,et al.  On comparing partitions , 2015 .

[17]  Jon Louis Bentley,et al.  Fast Algorithms for Constructing Minimal Spanning Trees in Coordinate Spaces , 1978, IEEE Transactions on Computers.

[18]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[19]  William Zhu,et al.  A New Local Density for Density Peak Clustering , 2018, PAKDD.

[20]  Morteza Haghir Chehreghani,et al.  Efficient Computation of Pairwise Minimax Distance Measures , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[21]  Avory Bryant,et al.  RNN-DBSCAN: A Density-Based Clustering Algorithm Using Reverse Nearest Neighbor Density Estimates , 2018, IEEE Transactions on Knowledge and Data Engineering.

[22]  Levent Ertoz,et al.  A New Shared Nearest Neighbor Clustering Algorithm and its Applications , 2002 .