Outlier factor based partitional clustering analysis with constraints discovery and representative objects generation

As a classical partitional clustering algorithm, k-means algorithm is sensitive to initial centroids and may malfunction when dealing with datasets which contain clusters with different scales and densities. To improve the effectiveness of k-means algorithm, an outlier factor based partitional clustering analysis method is presented in this paper. Outlier factor is usually used to indicate the degree of an object to be abnormal in the dataset. For the proposed method, it is used to find the core objects. And then the Must-link constraints is generated to put the neighboring core objects into the same cluster. First, a similar-density-array-based outlier factor is proposed to find the core objects in the dataset. Then the neighboring core objects are distributed into the same sub-cluster. The sub-clusters are treated as the representative objects and these representative objects are then clustered following the process of the traditional k-means algorithm. Finally, the non-core objects are assigned to their nearest clusters, respectively. The experiments are performed on four datasets from UCI Machine Learning Repository and a field dataset from a ball mill pulverizing system. The experimental results verify that the effectiveness of our algorithm is high. HighlightsNew method to combine outlier factor with k-means algorithm, and local information is effectively used.The proposed method is compared with two classical constrained k-means algorithms.Besides UCI datasets, a field dataset from a ball mill pulverizing system is used in the experiment.An outlier factor is proposed in the paper and is compared with LOF and COF in the experiment.The impact of the parameter of the outlier factors are tested and discussed in detail in the experiment.

[1]  YAN LI,et al.  Using a Variable Weighting k-Means Method to Build a Decision Cluster Classification Model , 2012, Int. J. Pattern Recognit. Artif. Intell..

[2]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[3]  Jian Tang,et al.  Capabilities of outlier detection schemes in large datasets, framework and methodologies , 2006, Knowledge and Information Systems.

[4]  Philip A. Chou,et al.  Entropy-constrained vector quantization , 1989, IEEE Trans. Acoust. Speech Signal Process..

[5]  Chengke Zhou,et al.  Application of K-Means method to pattern recognition in on-line cable partial discharge monitoring , 2013, IEEE Transactions on Dielectrics and Electrical Insulation.

[6]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[7]  Vassilia Karathanassi,et al.  Estimation of the Number of Endmembers Using Robust Outlier Detection Method , 2014, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[8]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[9]  Jim Z. C. Lai,et al.  Fast global k-means clustering using cluster membership and inequality , 2010, Pattern Recognit..

[10]  Sankar K. Pal,et al.  Rough Sets, Kernel Set, and Spatiotemporal Outlier Detection , 2014, IEEE Transactions on Knowledge and Data Engineering.

[11]  Feng Zhao,et al.  Robust Local Feature Weighting Hard C-Means Clustering Algorithm , 2011, IScIDE.

[12]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[13]  B. E. Reddy,et al.  Single pass kernel k-means clustering method , 2013, Sadhana.

[14]  YuJian,et al.  Partitive clustering (K-means family) , 2012 .

[15]  B. Eswara Reddy,et al.  Speeding-up the kernel k-means clustering method: A prototype based hybrid approach , 2013, Pattern Recognit. Lett..

[16]  Erwie Zahara,et al.  A hybridized approach to data clustering , 2008, Expert Syst. Appl..

[17]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[18]  Yunming Ye,et al.  TW-k-means: Automated two-level variable weighting clustering algorithm for multiview data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[19]  Joydeep Ghosh,et al.  Competitive Learning With Pairwise Constraints , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[21]  Yao Zhao,et al.  Multimodal Fusion for Video Search Reranking , 2010, IEEE Transactions on Knowledge and Data Engineering.

[22]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[23]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[24]  Yanbin Zhang,et al.  Enhancing effectiveness of density-based outlier mining scheme with density-similarity-neighbor-based outlier factor , 2010, Expert Syst. Appl..

[25]  Michael K. Ng,et al.  An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data , 2007, IEEE Transactions on Knowledge and Data Engineering.

[26]  Nicolas Labroche,et al.  Active Learning for Semi-Supervised K-Means Clustering , 2010, 2010 22nd IEEE International Conference on Tools with Artificial Intelligence.

[27]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[28]  Yiu-ming Cheung,et al.  Semi-Supervised Maximum Margin Clustering with Pairwise Constraints , 2012, IEEE Transactions on Knowledge and Data Engineering.

[29]  Wenquan Chen,et al.  Cluster analysis based on attractor particle swarm optimization with boundary zoomed for working conditions classification of power plant pulverizing system , 2013, Neurocomputing.

[30]  Zhong Wang,et al.  Automatic Outlier Detection for Genome Assembly Quality Assessment , 2013, 2013 IEEE 9th International Conference on e-Science.

[31]  Jianlin Wei,et al.  A new model-based approach for power plant Tube-ball mill condition monitoring and fault detection , 2014 .

[32]  Jiye Liang,et al.  Fast global k-means clustering based on local geometrical information , 2013, Inf. Sci..

[33]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[34]  Yasuhiro Hayashi,et al.  A Versatile Clustering Method for Electricity Consumption Pattern Analysis in Households , 2013, IEEE Transactions on Smart Grid.

[35]  Sanghamitra Bandyopadhyay,et al.  A symmetry based multiobjective clustering technique for automatic evolution of clusters , 2010, Pattern Recognit..

[36]  Jian Yu,et al.  Partitive clustering (K‐means family) , 2012, WIREs Data Mining Knowl. Discov..

[37]  Renato Cordeiro de Amorim,et al.  Constrained clustering with Minkowski Weighted K-Means , 2012 .

[38]  Roberto Battiti,et al.  A Survey of Semi-Supervised Clustering Algorithms: from a priori scheme to interactive scheme and open issues , 2013 .

[39]  Yung-Yu Chuang,et al.  Multiple Kernel Fuzzy Clustering , 2012, IEEE Transactions on Fuzzy Systems.

[40]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[41]  Shengrui Wang,et al.  Model-Based Method for Projective Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.