Unique Neighborhood Set Parameter Independent Density-Based Clustering With Outlier Detection

Machine learning algorithms such as clustering, classification, and regression typically require a set of parameters to be provided by the user before the algorithms can perform well. In this paper, we present parameter independent density-based clustering algorithms by utilizing two novel concepts for neighborhood functions which we term as unique closest neighbor and unique neighborhood set. We discuss two derivatives of the proposed parameter independent density-based clustering (PIDC) algorithms, termed PIDC-WO and PIDC-O. PIDC-WO has been designed for data sets that do not contain explicit outliers whereas PIDC-O provides very good performance even on data sets with the presence of outliers. PIDC-O uses a two-stage processing where the first stage identifies and removes outliers before passing the records to the second stage to perform the density-based clustering. The PIDC algorithms are extensively evaluated and compared with other well-known clustering algorithms on several data sets using three cluster evaluation criteria (F-measure, entropy, and purity) used in the literature, and are shown to perform effectively both for the clustering and outlier detection objectives.

[1]  Jing Tian,et al.  Rolling element bearing fault detection using density-based clustering , 2014, 2014 International Conference on Prognostics and Health Management.

[2]  Lazaros Mavridis,et al.  PFClust: a novel parameter free clustering algorithm , 2013, BMC Bioinformatics.

[3]  Christophe Ley,et al.  Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median , 2013 .

[4]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[5]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[6]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[7]  Wang Pan,et al.  A Density-Based Clustering Algorithm for High-Dimensional Data with Feature Selection , 2016, 2016 International Conference on Industrial Informatics - Computing Technology, Intelligent Technology, Industrial Information Integration (ICIICII).

[8]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[9]  Xiaolong Wang,et al.  An adaptive affinity propagation document clustering , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[10]  Jian-Huang Lai,et al.  APSCAN: A parameter free algorithm for clustering , 2011, Pattern Recognit. Lett..

[11]  Jian Hou,et al.  A Parameter-Independent Clustering Framework , 2017, IEEE Transactions on Industrial Informatics.

[12]  Wen-Liang Hung,et al.  Automatic clustering algorithm for fuzzy data , 2015 .

[13]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[14]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Lin Wu,et al.  Robust Subspace Clustering for Multi-View Data by Exploiting Correlation Consensus , 2015, IEEE Transactions on Image Processing.

[17]  Lin Wu,et al.  Iterative Views Agreement: An Iterative Low-Rank Based Structured Optimization Method to Multi-View Spectral Clustering , 2016, IJCAI.

[18]  Aristidis Likas,et al.  The Global Kernel $k$-Means Algorithm for Clustering in Feature Space , 2009, IEEE Transactions on Neural Networks.

[19]  Rudolf Kruse,et al.  Variable density based clustering , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[20]  Xuelong Li,et al.  DSets-DBSCAN: A Parameter-Free Clustering Algorithm , 2016, IEEE Transactions on Image Processing.

[21]  Qingsheng Zhu,et al.  An Effective Algorithm Based on Density Clustering Framework , 2017, IEEE Access.

[22]  K. Koonsanit,et al.  Parameter-Free K-Means Clustering Algorithm for Satellite Imagery Application , 2012, 2012 International Conference on Information Science and Applications.

[23]  Jing Tian,et al.  Anomaly Detection Using Self-Organizing Maps-Based K-Nearest Neighbor Algorithm , 2014 .

[24]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[25]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[26]  Carlos Henggeler Antunes,et al.  Automatic Clustering Using a Genetic Algorithm with New Solution Encoding and Operators , 2014, ICCSA.

[27]  Swagatam Das,et al.  Automatic Clustering Using an Improved Differential Evolution Algorithm , 2007 .

[28]  Maya R. Gupta,et al.  Theory and Use of the EM Algorithm , 2011, Found. Trends Signal Process..