A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with noise

Abstract Hierarchical clustering is widely used in data mining. The single linkage criterion is powerful, as it allows for handling various shapes and densities, but it is sensitive to noise 1 . Two improvements are proposed in this work to deal with noise. First, the single linkage criterion takes into account the local density to make sure the distance involves core points of each group. Second, the hierarchical algorithm forbids the merging of representative clusters, higher than a minimum size, once identified. The experiments include a sensitivity analysis to the parameters and a comparison of the available criteria using datasets known in the literature. The latter proved that local criteria yield better results than global ones. Then, the three single linkage criteria were compared in more challenging situations that highlighted the complementariness between the two levels of improvement: the criterion and the clustering algorithm.

[1]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[2]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[3]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[4]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[5]  Serge Guillaume,et al.  ProTraS: A probabilistic traversing sampling algorithm , 2018, Expert Syst. Appl..

[6]  David G. Stork,et al.  Pattern Classification , 1973 .

[7]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.

[8]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Marcílio Carlos Pereira de Souto,et al.  Impact of Base Partitions on Multi-objective and Traditional Ensemble Clustering Algorithms , 2015, ICONIP.

[10]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[11]  Serge Guillaume,et al.  DENDIS: A new density-based sampling for clustering algorithm , 2016, Expert Syst. Appl..

[12]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[13]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[16]  Jong-Seok Lee,et al.  Data clustering by minimizing disconnectivity , 2011, Inf. Sci..

[17]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[18]  Roberto Bellotti Hausdorff Clustering , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[20]  Serge Guillaume,et al.  DIDES: a fast and effective sampling for clustering algorithm , 2017, Knowledge and Information Systems.