Automatic Clustering via Outward Statistical Testing on Density Metrics

Clustering is one of the research hotspots in the field of data mining and has extensive applications in practice. Recently, Rodriguez and Laio [1] published a clustering algorithm on Science that identifies the clustering centers in an intuitive way and clusters objects efficiently and effectively. However, the algorithm is sensitive to a preassigned parameter and suffers from the identification of the “ideal” number of clusters. To overcome these shortages, this paper proposes a new clustering algorithm that can detect the clustering centers automatically via statistical testing. Specifically, the proposed algorithm first defines a new metric to measure the density of an object that is more robust to the preassigned parameter, further generates a metric to evaluate the centrality of each object. Afterwards, it identifies the objects with extremely large centrality metrics as the clustering centers via an outward statistical testing method. Finally, it groups the remaining objects into clusters containing their nearest neighbors with higher density. Extensive experiments are conducted over different kinds of clustering data sets to evaluate the performance of the proposed algorithm and compare with the algorithm in Science. The results show the effectiveness and robustness of the proposed algorithm.

[1]  Spiros Mancoridis,et al.  Automatic clustering of software systems using a genetic algorithm , 1999, STEP '99. Proceedings Ninth International Workshop Software Technology and Engineering Practice.

[2]  Bevan K. Youse,et al.  Introduction to real analysis , 1972 .

[3]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[4]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[5]  T. Warren Liao,et al.  Clustering of time series data - a survey , 2005, Pattern Recognit..

[6]  Andrea Vattani,et al.  k-means Requires Exponentially Many Iterations Even in the Plane , 2008, SCG '09.

[7]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[8]  Vincent S. Tseng,et al.  A novel two-level clustering method for time series data analysis , 2010, Expert Syst. Appl..

[9]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[11]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[12]  O. Sourina,et al.  Free-parameters clustering of spatial data with non-uniform density , 2004, IEEE Conference on Cybernetics and Intelligent Systems, 2004..

[13]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[14]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.

[15]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[16]  Brendan J. Frey,et al.  Non-metric affinity propagation for unsupervised image categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[17]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[18]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[19]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[20]  A. Hoffman,et al.  Lower bounds for the partitioning of graphs , 1973 .

[21]  S. Foss,et al.  An Introduction to Heavy-Tailed and Subexponential Distributions , 2011 .

[22]  Mark Trede,et al.  Identifying multiple outliers in heavy-tailed distributions with an application to market crashes , 2008 .

[23]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[24]  Wojciech A. Trybulec Pigeon Hole Principle , 1990 .

[25]  Pasi Fränti,et al.  Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Boris G. Mirkin,et al.  Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads , 2010, J. Classif..

[27]  Santanu Phadikar,et al.  Automatic Color Image Segmentation Using Spatial Constraint Based Clustering , 2014 .

[28]  K. Karteeka Pavan,et al.  An Automatic Clustering Technique for Optimal Clusters , 2011, ArXiv.

[29]  Ming Xie,et al.  Color clustering and learning for image segmentation based on neural networks , 2005, IEEE Trans. Neural Networks.

[30]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[31]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[32]  W. Rogers,et al.  Understanding some long-tailed symmetrical distributions , 1972 .

[33]  Shashi Shekhar,et al.  Clustering and Information Retrieval , 2011, Network Theory and Applications.

[34]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[35]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[36]  Sriparna Saha,et al.  A generalized automatic clustering algorithm in a multiobjective framework , 2013, Appl. Soft Comput..

[37]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[38]  Hong He,et al.  A two-stage genetic algorithm for automatic clustering , 2012, Neurocomputing.

[39]  Pasi Fränti,et al.  A Dynamic local search algorithm for the clustering problem , 2002 .

[40]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[41]  Swagatam Das,et al.  Automatic Clustering Using an Improved Differential Evolution Algorithm , 2007 .

[42]  Weihong Cui,et al.  A Novel Spatial Clustering Algorithm Based on Delaunay Triangulation , 2010, J. Softw. Eng. Appl..

[43]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[44]  Boris G. Mirkin,et al.  Choosing the number of clusters , 2011, Wiley Interdiscip. Rev. Data Min. Knowl. Discov..

[45]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[46]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[47]  Pasi Fränti,et al.  Iterative shrinking method for clustering problems , 2006, Pattern Recognit..

[48]  Rui Xu,et al.  Clustering Algorithms in Biomedical Research: A Review , 2010, IEEE Reviews in Biomedical Engineering.

[49]  Meirav Galun,et al.  Fundamental Limitations of Spectral Clustering , 2006, NIPS.

[50]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Alexander Kolesnikov,et al.  Estimating the number of clusters in a numerical data set via quantization error modeling , 2015, Pattern Recognit..

[52]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[53]  Ramandeep Kaur,et al.  A Survey of Clustering Techniques , 2010 .

[54]  V. Estivill-Castro,et al.  Argument free clustering for large spatial point-data sets via boundary extraction from Delaunay Diagram , 2002 .

[55]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .