Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assigning new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predicting cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.

[1]  Saptarshi Chakraborty,et al.  Entropy Regularized Power k-Means Clustering , 2020, ArXiv.

[2]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[3]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[4]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[5]  Robert Tibshirani,et al.  A Framework for Feature Selection in Clustering , 2010, Journal of the American Statistical Association.

[6]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[7]  Fionn Murtagh,et al.  Handbook of Cluster Analysis , 2015 .

[8]  Thomas Seidl,et al.  Using internal evaluation measures to validate the quality of diverse stream clustering algorithms , 2017, Vietnam Journal of Computer Science.

[9]  Martin Vetterli,et al.  Euclidean Distance Matrices: Essential theory, algorithms, and applications , 2015, IEEE Signal Processing Magazine.

[10]  Mario Giacobini,et al.  Predictive Modeling for Frailty Conditions in Elderly People: Machine Learning Approaches , 2019, JMIR medical informatics.

[11]  Robert L. Winkler,et al.  The accuracy of extrapolation (time series) methods: Results of a forecasting competition , 1982 .

[12]  C. Lewis Industrial and business forecasting methods : a practical guide to exponential smoothing and curve fitting , 1982 .

[13]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[14]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Alessandra Alaniz Macedo,et al.  A multi-label approach using binary relevance and decision trees applied to functional genomics , 2015, J. Biomed. Informatics.

[16]  Xiaoqing Zhang,et al.  A Novel Deep Neural Network Model for Multi-Label Chronic Disease Prediction , 2019, Front. Genet..

[17]  Arthur Zimek,et al.  Density-Based Clustering Validation , 2014, SDM.

[18]  D Napoleon,et al.  A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set , 2011 .

[19]  Parvinder S. Sandhu,et al.  A Subtractive Clustering Based Approach for Early Prediction of Fault Proneness in Software Modules , 2010 .

[20]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[21]  Biju R. Mohan,et al.  An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop , 2014, 2014 9th International Conference on Industrial and Information Systems (ICIIS).

[22]  Dubravko Miljkovic,et al.  Brief review of self-organizing maps , 2017, 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[23]  Peter O. Olukanmi,et al.  Rethinking k-means clustering in the age of massive datasets: a constant-time approach , 2019, Neural Computing and Applications.

[24]  Mario Giacobini,et al.  Detection of Frailty Using Genetic Programming , 2020, EuroGP.

[25]  Jiashun Jin,et al.  Influential Feature PCA for high dimensional clustering , 2014, 1407.5241.

[26]  Shai Ben-David,et al.  Relating Clustering Stability to Properties of Cluster Boundaries , 2008, COLT.

[27]  Grigorios Tsoumakas,et al.  Multi-Label Classification of Music into Emotions , 2008, ISMIR.

[28]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[29]  Swagatam Das,et al.  k-Means clustering with a new divergence-based distance metric: Convergence and performance analysis , 2017, Pattern Recognit. Lett..

[30]  Robert L. Hale,et al.  Cluster analysis in school psychology: An example , 1981 .

[31]  Renato Cordeiro de Amorim,et al.  Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering , 2012, Pattern Recognit..

[32]  Alexander Rakhlin,et al.  Stability of $K$-Means Clustering , 2006, NIPS.

[33]  Brian Everitt,et al.  Cluster analysis , 1974 .

[34]  Junhui Wang Consistent selection of the number of clusters via crossvalidation , 2010 .

[35]  J. Do,et al.  Normalization of microarray data: single-labeled and dual-labeled arrays. , 2006, Molecules and cells.

[36]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[37]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[39]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[40]  Wentian Li,et al.  Application of t-SNE to Human Genetic Data , 2017, bioRxiv.

[41]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[42]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[43]  Tinghuai Ma,et al.  An efficient and scalable density-based clustering algorithm for datasets with complex structures , 2016, Neurocomputing.

[44]  D. Wilks Chapter 15 - Cluster Analysis , 2011 .

[45]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .