A novel data clustering algorithm using heuristic rules based on k-nearest neighbors chain

Abstract In practice, clustering algorithms usually suffer from the complex structure of the dataset, including data distribution and dimensionality. Meanwhile, the number of clusters, which is required as an input, is usually unavailable. In this paper, we propose a novel data clustering algorithm: it uses heuristic rules based on k-nearest neighbors chain and does not require the number of clusters as the input parameter. Inspired by the PageRank algorithm, we first use random walk model to measure the importance of data points. Then, on the basis of the important data points, we build a K-Nearest Neighbors Chain (KNNC) to order the k nearest neighbors by distance and propose two heuristic rules to find the proper number of clusters and initial clusters. The first heuristic rule is the gap of KNNC which reflects the degree of separation of clusters with convex shapes and the second one is the nearest neighbor gap of KNNC which reflects the inner compactness of a cluster. Comprehensive comparison results on synthetic and real datasets indicate that the proposed clustering algorithm can find the proper number of clusters and achieve comparable or even better performance than the popular clustering algorithms.

[1]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[2]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..

[3]  Ulrike von Luxburg,et al.  Cluster Identification in Nearest-Neighbor Graphs , 2007, ALT.

[4]  Xinquan Chen,et al.  An effective synchronization clustering algorithm , 2016, Applied Intelligence.

[5]  Bo Shen,et al.  MDBSCAN: Multi-level Density Based Spatial Clustering of Applications with Noise , 2016, KMO.

[6]  Ming Liu,et al.  An influence power-based clustering approach with PageRank-like model , 2016, Appl. Soft Comput..

[7]  Beng Chin Ooi,et al.  BORDER: efficient computation of boundary points , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  Tengke Xiong,et al.  DHCC: Divisive hierarchical clustering of categorical data , 2011, Data Mining and Knowledge Discovery.

[9]  Christian Böhm,et al.  Synchronization-Inspired Partitioning and Hierarchical Clustering , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Athman Bouguettaya,et al.  Efficient agglomerative hierarchical clustering , 2015, Expert Syst. Appl..

[11]  Qing Li,et al.  Knowledge Discovery and Data Mining - PAKDD 2001, 5th Pacific-Asia Conference, Hong Kong, China, April 16-18, 2001, Proceedings , 2001, PAKDD.

[12]  Christian Böhm,et al.  Detection of Arbitrarily Oriented Synchronized Clusters in High-Dimensional Data , 2011, 2011 IEEE 11th International Conference on Data Mining.

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[14]  Shalini S Singh,et al.  K-means v/s K-medoids: A Comparative Study , 2011 .

[15]  Hamid Parvin,et al.  Proposing a classifier ensemble framework based on classifier selection and decision tree , 2015, Eng. Appl. Artif. Intell..

[16]  D. Mitchell Wilkes,et al.  A Divide-and-Conquer Approach for Minimum Spanning Tree-Based Clustering , 2009, IEEE Transactions on Knowledge and Data Engineering.

[17]  Xiaofeng Cao,et al.  Clustering boundary detection for high dimensional space based on space inversion and Hopkins statistics , 2016, Knowl. Based Syst..

[18]  Marimuthu Palaniswami,et al.  Fuzzy c-Means Algorithms for Very Large Data , 2012, IEEE Transactions on Fuzzy Systems.

[19]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[20]  Kenneth D. Harris,et al.  High-Dimensional Cluster Analysis with the Masked EM Algorithm , 2013, Neural Computation.

[21]  QiuBaozhi,et al.  Clustering boundary detection for high dimensional space based on space inversion and Hopkins statistics , 2016 .

[22]  Sraban Kumar Mohanty,et al.  Fast Minimum Spanning Tree Based Clustering Algorithms on Local Neighborhood Graph , 2015, GbRPR.

[23]  Ganapati Panda,et al.  Design of computationally efficient density-based clustering algorithms , 2015, Data Knowl. Eng..

[24]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[25]  Chris H. Q. Ding,et al.  Hierarchical Ensemble Clustering , 2010, 2010 IEEE International Conference on Data Mining.

[26]  Maria-Florina Balcan,et al.  Robust hierarchical clustering , 2013, J. Mach. Learn. Res..

[27]  Hamid Parvin,et al.  An Ensemble Based Approach for Feature Selection , 2011, EANN/AIAI.

[28]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[29]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[30]  Claudia Plant,et al.  Clustering by synchronization , 2010, KDD.

[31]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[32]  Raj Bhatnagar,et al.  Graph Clustering Using Mutual K-Nearest Neighbors , 2014, AMT.

[33]  Franco Scarselli,et al.  Inside PageRank , 2005, TOIT.

[34]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[35]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[36]  William F. Punch,et al.  Data weighing mechanisms for clustering ensembles , 2013, Comput. Electr. Eng..

[37]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[38]  Ming Liu,et al.  K-PRSCAN: A clustering method based on PageRank , 2016, Neurocomputing.