Mining Arbitrary Shaped Clusters and Outputting a High Quality Dendrogram

Hierarchical clustering HC for short outputs a dendrogram that offers more topological information than flat clustering e.g., k-means. However, the existing HC algorithms focus on either the quality of the dendrogram or the ability of mining arbitrary shaped clusters. To address the above two aspects simultaneously, we present HICMEN by adopting 1 the classic agglomerative clustering framework that can generate a complete dendrogram, and 2 a noveli¾?similarity measure based on mutual k-nearest neighbors to capture the connectivity of data points and help properly merge up each arbitrary shaped cluster piece by piece. More importantly, we prove that the similarity measure has a nice property called weak monotonicity, which guarantees the quality of the dendrogram generated by HICMEN. Extensive experimental results show that HICMEN is capable of mining arbitrary shaped clusters effectively, and can simultaneously output a high quality dendrogram.

[1]  Kenji Mizuguchi,et al.  Integrated Pathway Clusters with Coherent Biological Themes for Target Prioritisation , 2014, PloS one.

[2]  Ellen M. Voorhees,et al.  Implementing agglomerative hierarchic clustering algorithms for use in document retrieval , 1986, Inf. Process. Manag..

[3]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[4]  Yang Yang,et al.  Multitask Spectral Clustering by Exploring Intertask Correlation , 2015, IEEE Transactions on Cybernetics.

[5]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[6]  Younghoon Kim,et al.  DBCURE-MR: An efficient density-based clustering algorithm for large data using MapReduce , 2014, Inf. Syst..

[7]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[8]  Hui Xiong,et al.  High-dimensional clustering: a clique-based hypergraph partitioning framework , 2012, Knowledge and Information Systems.

[9]  Yunjun Gao,et al.  Towards effective and efficient mining of arbitrary shaped clusters , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[10]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[11]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[12]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[13]  Geng Li,et al.  ABACUS: Mining Arbitrary Shaped Clusters from Large Datasets based on Backbone Identification , 2011, SDM.

[14]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[17]  Mohammad Al Hasan,et al.  Under consideration for publication in Knowledge and Information Systems SPARCL: An Effective and Efficient Algorithm for Mining Arbitrary Shape-based Clusters 1 , 2022 .

[18]  Yunjun Gao,et al.  Browse with a social web directory , 2013, SIGIR.

[19]  Haiqiao Huang,et al.  A robust adaptive clustering analysis method for automatic identification of clusters , 2012, Pattern Recognit..

[20]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[21]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[22]  Yingjie Xia,et al.  Scalable Constrained Spectral Clustering , 2015, IEEE Transactions on Knowledge and Data Engineering.

[23]  Peter Lindstrom,et al.  Locally-scaled spectral clustering using empty region graphs , 2012, KDD.

[24]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..

[25]  M. E. Houle The Relevant‐Set Correlation Model for Data Clustering , 2008, Stat. Anal. Data Min..