A Distance and Density-Based Clustering Algorithm Using Automatic Peak Detection

Distance-based and density-based clustering algorithms are often used on large spatial and arbitrary shape of data sets. However, some well-known clustering algorithms have troubles when distribution of objects in the dataset varies, and this may lead to a bad clustering result. Such bad performances are more dramatically significant on high-dimensional dataset. Recently, Rodriguez and Laio proposed an efficient clustering algorithm based on two essential indicators: density and distance, which are used to find the cluster centers and play an important role in the process of clustering. However, this algorithm does not work well on high dimensional data sets, since the threshold of cluster centers has been defined ambiguously and hence it has to be decided visually and manually. In this paper, an alternative definition of the indicators is introduced and the threshold of cluster centers is automatically decided by using an improved Canopy algorithm. With fixed centers (each represents a cluster), each remaining data object is assigned to a cluster dependently in a single step. The performance of the algorithm is analyzed on several benchmarks. The experimental results show that (1) the clustering performance on some high dimensional data sets, e.g., intrusion detection, is better; and (2) on low dimensional data sets, the performances are as good as the traditional clustering algorithms.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Mohammad Khubeb Siddiqui,et al.  Analysis of KDD CUP 99 Dataset using Clustering based Data Mining , 2013 .

[3]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[4]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.

[5]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[6]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[7]  Peng Liu,et al.  VDBSCAN: Varied Density Based Spatial Clustering of Applications with Noise , 2007, 2007 International Conference on Service Systems and Service Management.

[8]  Lutgarde M. C. Buydens,et al.  KNN-kernel density-based clustering for high-dimensional multivariate data , 2006, Comput. Stat. Data Anal..

[9]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[10]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[11]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[12]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .