Minimum spanning tree based split-and-merge: A hierarchical clustering method

Most clustering algorithms become ineffective when provided with unsuitable parameters or applied to datasets which are composed of clusters with diverse shapes, sizes, and densities. To alleviate these deficiencies, we propose a novel split-and-merge hierarchical clustering method in which a minimum spanning tree (MST) and an MST-based graph are employed to guide the splitting and merging process. In the splitting process, vertices with high degrees in the MST-based graph are selected as initial prototypes, and K-means is used to split the dataset. In the merging process, subgroup pairs are filtered and only neighboring pairs are considered for merge. The proposed method requires no parameter except the number of clusters. Experimental results demonstrate its effectiveness both on synthetic and real datasets.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  Santosh S. Vempala,et al.  A divide-and-merge methodology for clustering , 2005, PODS '05.

[3]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[4]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[5]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[6]  Chung-Chian Hsu,et al.  Hierarchical clustering of mixed data based on distance hierarchy , 2007, Inf. Sci..

[7]  Ming-Syan Chen,et al.  Density Conscious Subspace Clustering for High-Dimensional Data , 2010, IEEE Transactions on Knowledge and Data Engineering.

[8]  Matthijs J. Warrens,et al.  On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index , 2008, J. Classif..

[9]  P. Fränti,et al.  Iterative split-and-merge algorithm for VQ codebook generation , 1998 .

[10]  Pasi Fränti,et al.  Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ming-Syan Chen,et al.  Combining Partitional and Hierarchical Algorithms for Robust and Efficient Data Clustering with Cohesion Self-Merging , 2005, IEEE Trans. Knowl. Data Eng..

[12]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[14]  Michal Bereta,et al.  Immune K-means and negative selection algorithms for data analysis , 2009, Inf. Sci..

[15]  Xudong Jiang,et al.  A multi-prototype clustering algorithm , 2009, Pattern Recognit..

[16]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[17]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.

[18]  Satu Elisa Schaeffer,et al.  Graph Clustering , 2017, Encyclopedia of Machine Learning and Data Mining.

[19]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[20]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Chu-Sing Yang,et al.  A time-efficient pattern reduction algorithm for k-means clustering , 2011, Inf. Sci..

[22]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[23]  Jong-Seok Lee,et al.  Data clustering by minimizing disconnectivity , 2011, Inf. Sci..

[24]  Ming-Syan Chen,et al.  Reducing Redundancy in Subspace Clustering , 2009, IEEE Transactions on Knowledge and Data Engineering.

[25]  Jim Z. C. Lai,et al.  An agglomerative clustering algorithm using a dynamic k-nearest-neighbor list , 2011, Inf. Sci..

[26]  Ying Xu,et al.  Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees , 2002, Bioinform..

[27]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[28]  John P. Oakley,et al.  The Effect of Cluster Size , 1995 .

[29]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[30]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[31]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[32]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[33]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[34]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[35]  Sergios Theodoridis,et al.  Pattern Recognition, Fourth Edition , 2008 .

[36]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[37]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[38]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[40]  Godfried T. Toussaint,et al.  The relative neighbourhood graph of a finite planar set , 1980, Pattern Recognit..

[41]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[42]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[43]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[44]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[45]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[47]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[48]  Olli Nevalainen,et al.  Iterative split-and-merge algorithm for vector quantization codebook generation , 1998 .

[49]  Anil K. Jain,et al.  Data Clustering: A User's Dilemma , 2005, PReMI.

[50]  Dit-Yan Yeung,et al.  Robust path-based spectral clustering , 2008, Pattern Recognit..

[51]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[52]  Charu C. Aggarwal,et al.  Graph Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[53]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[54]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[55]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[56]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[57]  Chi-Hoon Lee,et al.  Clustering high dimensional data: A graph-based relaxed optimization approach , 2008, Inf. Sci..

[58]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..