Data clustering by minimizing disconnectivity

Identifying clusters of arbitrary shapes remains a challenge in the field of data clustering. We propose a new measure of cluster quality based on minimizing the penalty of disconnection between objects that would be ideally clustered together. This disconnectivity is based on analysis of nearest neighbors and the principle that an object should be in the same cluster as its nearest neighbors. An algorithm called MinDisconnect is proposed that heuristically minimizes disconnectivity and numerical results are presented that indicate that the new algorithm can effectively identify clusters of complex shapes and is robust in finding clusters of arbitrary shapes.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Adrian E. Raftery,et al.  MCLUST Version 3: An R Package for Normal Mixture Modeling and Model-Based Clustering , 2006 .

[3]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  S. Ólafsson,et al.  Data mining for recognizing patterns in foodborne disease outbreaks , 2010 .

[5]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[8]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[9]  Yanjun Wang,et al.  Harnessing data mining to explore incident databases. , 2006, Journal of hazardous materials.

[10]  Mohammad Hossein Fazel Zarandi,et al.  A general fuzzy-statistical clustering approach for estimating the time of change in variable sampling control charts , 2010, Inf. Sci..

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  J. Yang,et al.  An optimization approach to partitional data clustering , 2009, J. Oper. Res. Soc..

[13]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[14]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[15]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[16]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[17]  Girish N. Punj,et al.  Cluster Analysis in Marketing Research: Review and Suggestions for Application , 1983 .

[18]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[19]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[20]  Ohn Mar San,et al.  An alternative extension of the k-means algorithm for clustering categorical data , 2004 .

[21]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[22]  Chris H. Q. Ding,et al.  K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization , 2004, SAC '04.

[23]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[24]  Mao Ye,et al.  A tabu search approach for the minimum sum-of-squares clustering problem , 2008, Inf. Sci..

[25]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[26]  Chak-Kuen Wong,et al.  Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees , 1977, Acta Informatica.

[27]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[28]  Salvatore J. Stolfo,et al.  Distributed data mining in credit card fraud detection , 1999, IEEE Intell. Syst..

[29]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[30]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[31]  Derya Birant,et al.  ST-DBSCAN: An algorithm for clustering spatial-temporal data , 2007, Data Knowl. Eng..

[32]  Gang Wang,et al.  Crime data mining: a general framework and some examples , 2004, Computer.

[33]  Andries Petrus Engelbrecht,et al.  Data clustering using particle swarm optimization , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[34]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[35]  Xiaonan Li,et al.  Operations research and data mining , 2008, Eur. J. Oper. Res..

[36]  C. A. Murthy,et al.  In search of optimal clusters using genetic algorithms , 1996, Pattern Recognit. Lett..

[37]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[38]  Chi-Hoon Lee,et al.  Clustering high dimensional data: A graph-based relaxed optimization approach , 2008, Inf. Sci..

[39]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[40]  Bidyut Baran Chaudhuri,et al.  A novel genetic algorithm for automatic clustering , 2004, Pattern Recognit. Lett..

[41]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[42]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[43]  Hrishikesh D. Vinod Mathematica Integer Programming and the Theory of Grouping , 1969 .