KdMutual: A novel clustering algorithm combining mutual neighboring and hierarchical approaches using a new selection criterion

Abstract New clustering algorithms are expected to manage complex data, meaning various shapes and densities while being user friendly. This work addresses this challenge. A new clustering algorithm KdMutual 1 driven by the number of clusters is proposed. The idea behind the algorithm is based on the assumption that working with cluster cores rather than considering frontiers makes the clustering process easier. KdMutual is based on three steps: The first one aims at identifying the potential core clusters. It relies on mutual neighborhood and includes specific mechanisms to identify and preserve potential core clusters. The second step is based on a constrained hierarchical process that deals with noise. In the last step the potential clusters are selected using a specific ranking criterion and the final partition is built. KdMutual combines the best characteristics of density peaks and connectivity-based approaches. It is capable of detecting the non-presence of natural clusters. Tests were carried out to compare the proposal with 14 other clustering algorithms. Using 2-dimensional benchmark datasets of various shapes and densities they showed that KdMutual was highly effective in matching a ground truth target. It also proved efficient in high dimensions when clusters are well separated. Moreover, it is able to identify clusters of various densities, partially overlapping and including a large amount of noise within spaces of moderate dimension.

[1]  Serge Guillaume,et al.  ProTraS: A probabilistic traversing sampling algorithm , 2018, Expert Syst. Appl..

[2]  D. R. Fulkerson,et al.  On the Max Flow Min Cut Theorem of Networks. , 1955 .

[3]  Kai Ming Ting,et al.  Density-ratio based clustering for discovering clusters with varying densities , 2016, Pattern Recognit..

[4]  G. Krishna,et al.  Agglomerative clustering using the concept of mutual nearest neighbourhood , 1978, Pattern Recognit..

[5]  Chin-Teng Lin,et al.  A review of clustering techniques and developments , 2017, Neurocomputing.

[6]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[7]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[8]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9]  Jong-Seok Lee,et al.  Data clustering by minimizing disconnectivity , 2011, Inf. Sci..

[10]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[11]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[12]  Xin Lu,et al.  Spatial clustering with Density-Ordered tree , 2016 .

[13]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[14]  Chunyan Miao,et al.  REDPC: A residual error-based density peak clustering algorithm , 2019, Neurocomputing.

[15]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Hongjie Jia,et al.  Study on density peaks clustering based on k-nearest neighbors and principal component analysis , 2016, Knowl. Based Syst..

[17]  Marcílio Carlos Pereira de Souto,et al.  Impact of Base Partitions on Multi-objective and Traditional Ensemble Clustering Algorithms , 2015, ICONIP.

[18]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..

[19]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[20]  Serge Guillaume,et al.  A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with noise , 2019, Expert Syst. Appl..

[21]  Xiao Xu,et al.  An improved density peaks clustering algorithm with fast finding cluster centers , 2018, Knowl. Based Syst..

[22]  Shikha Mehta,et al.  Subspace Clustering of High Dimensional Data Using Differential Evolution , 2019, Nature-Inspired Algorithms for Big Data Frameworks.

[23]  Shuliang Wang,et al.  Clustering by Fast Search and Find of Density Peaks with Data Field , 2016 .

[24]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[25]  Serge Guillaume,et al.  Munec: a mutual neighbor-based clustering algorithm , 2019, Inf. Sci..

[26]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[27]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[28]  Jong-Seok Lee,et al.  A meta-learning approach for determining the number of clusters with consideration of nearest neighbors , 2013, Inf. Sci..

[29]  Keqin Li,et al.  DPC-LG: Density peaks clustering based on logistic distribution and gravitation , 2019, Physica A: Statistical Mechanics and its Applications.

[30]  Yongchuan Tang,et al.  Comparative density peaks clustering , 2018, Expert Syst. Appl..

[31]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[32]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[33]  Daniel A. Keim,et al.  A General Approach to Clustering in Large Databases with Noise , 2003, Knowledge and Information Systems.

[34]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.