Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to Clustering

Grouping the objects based on their similarities is an important common task in machine learning applications. Many clustering methods have been developed, among them k-means based clustering methods have been broadly used and several extensions have been developed to improve the original k-means clustering method such as k-means ++ and kernel k-means. K-means is a linear clustering method; that is, it divides the objects into linearly separable groups, while kernel k-means is a non-linear technique. Kernel k-means projects the elements to a higher dimensional feature space using a kernel function, and then groups them. Different kernel functions may not perform similarly in clustering of a data set and, in turn, choosing the right kernel for an application could be challenging. In our previous work, we introduced a weighted majority voting method for clustering based on normalized mutual information (NMI). NMI is a supervised method where the true labels for a training set are required to calculate NMI. In this study, we extend our previous work of aggregating the clustering results to develop an unsupervised weighting function where a training set is not available. The proposed weighting function here is based on Silhouette index, as an unsupervised criterion. As a result, a training set is not required to calculate Silhouette index. This makes our new method more sensible in terms of clustering concept.

[1]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[2]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[5]  James C. Bezdek,et al.  Validity-guided (re)clustering with applications to image segmentation , 1996, IEEE Trans. Fuzzy Syst..

[6]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[7]  José Luis Rojo-Álvarez,et al.  Kernel Methods in Bioengineering, Signal And Image Processing , 2007 .

[8]  Piotr A. Kowalski,et al.  Complete Gradient Clustering Algorithm for Features Analysis of X-Ray Images , 2010 .

[9]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[10]  M. Cugmas,et al.  On comparing partitions , 2015 .

[11]  Junjie Wu,et al.  Advances in K-means clustering: a data mining thinking , 2012 .

[12]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[13]  Nezamoddin N. Kachouie,et al.  Weighted Mutual Information for Aggregated Kernel Clustering , 2020, Entropy.

[14]  José Luis Rojo-Álvarez,et al.  An Introduction to Kernel Methods , 2009, Encyclopedia of Data Warehousing and Mining.

[15]  James M. Keller,et al.  Fuzzy Models and Algorithms for Pattern Recognition and Image Processing , 1999 .