A Meta-learning approach for recommending the number of clusters for clustering algorithms

Abstract One of the main challenges in Clustering Analysis is choosing the optimal number of clusters. A typical methodology is to evaluate a validity index over the data and to optimize it as a function of the number of clusters. However, this process can have a high computational cost. In this work, we introduce a new approach for recommending the number of clusters for a particular dataset by using Meta-learning. As the predictive performance of the meta-models induced by Meta-learning is affected by how datasets are described by meta-features, we propose a new set of meta-features able to improve the predictive performance of meta-models used for recommending the number of clusters. Experimental results show that the proposed approach provides a good recommendation of the number of clusters. Additionally, the proposed meta-feature obtains better results than meta-features for clustering tasks found in the literature.

[1]  Teresa Bernarda Ludermir,et al.  Meta-learning approaches to selecting time series models , 2004, Neurocomputing.

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  Murchhana Tripathy,et al.  A Study of Algorithm Selection in Data Mining using Meta-Learning , 2017 .

[5]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[6]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Metalearning and Recommender Systems: A literature review and empirical study on the algorithm selection problem for Collaborative Filtering , 2018, Inf. Sci..

[7]  Antonio González Muñoz,et al.  On the use of meta-learning for instance selection: An architecture and an experimental study , 2014, Inf. Sci..

[8]  M. Lungaroni,et al.  On the Use of Entropy to Improve Model Selection Criteria , 2019, Entropy.

[9]  Hongjie Jia,et al.  An Improvement of Spectral Clustering via Message Passing and Density Sensitive Similarity , 2019, IEEE Access.

[10]  Zhongzhi Shi,et al.  A multiway p-spectral clustering algorithm , 2019, Knowl. Based Syst..

[11]  Hamido Fujita,et al.  A study of graph-based system for multi-view clustering , 2019, Knowl. Based Syst..

[12]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[13]  Bernd Bischl,et al.  To tune or not to tune: Recommending when to adjust SVM hyper-parameters via meta-learning , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[14]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[15]  Leandro Nunes de Castro,et al.  Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods , 2015, Inf. Sci..

[16]  Hamido Fujita,et al.  Low-rank local tangent space embedding for subspace clustering , 2020, Inf. Sci..

[17]  P. Brazdil,et al.  Analysis of results , 1995 .

[18]  João Mendes-Moreira,et al.  Towards Automatic Generation of Metafeatures , 2016, PAKDD.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[20]  T. Santner,et al.  On the small sample properties of norm-restricted maximum likelihood estimators for logistic regression models , 1989 .

[21]  Ashok N. Srivastava,et al.  Data Mining: Concepts, Models, Methods, and Algorithms , 2005, J. Comput. Inf. Sci. Eng..

[22]  Antonio González Muñoz,et al.  Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective , 2015, Pattern Recognit..

[23]  Alexandros Kalousis,et al.  Algorithm selection via meta-learning , 2002 .

[24]  Bogdan Gabrys,et al.  Meta-learning for time series forecasting and forecast combination , 2010, Neurocomputing.

[25]  Qinbao Song,et al.  A Generic Multilabel Learning-Based Classification Algorithm Recommendation Method , 2014, TKDD.

[26]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Meta-learning to select the best meta-heuristic for the Traveling Salesman Problem: A comparison of meta-features , 2016, Neurocomputing.

[27]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Statistical versus Distance-Based Meta-Features for Clustering Algorithm recommendation Using Meta-Learning , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[28]  Boris Delibasic,et al.  Extending meta-learning framework for clustering gene expression data with component-based algorithm design and internal evaluation measures , 2016, Int. J. Data Min. Bioinform..

[29]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Noise detection in the meta-learning level , 2016, Neurocomputing.

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Effects of Random Sampling on SVM Hyper-parameter Tuning , 2016, ISDA.

[33]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A new data characterization for selecting clustering algorithms using meta-learning , 2019, Inf. Sci..

[34]  Stan Matwin,et al.  Ensembles of label noise filters: a ranking approach , 2016, Data Mining and Knowledge Discovery.

[35]  Ricardo Vilalta,et al.  Metalearning - Applications to Data Mining , 2008, Cognitive Technologies.

[36]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[37]  John Riedl,et al.  When recommenders fail: predicting recommender failure for algorithm selection and combination , 2012, RecSys.

[38]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[39]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[40]  Aderemi Oluyinka Adewumi,et al.  Simulated annealing based symbiotic organisms search optimization algorithm for traveling salesman problem , 2017, Expert Syst. Appl..

[41]  Zhang Yi,et al.  A multitask multiview clustering algorithm in heterogeneous situations based on LLE and LE , 2019, Knowl. Based Syst..

[42]  Xiao Xu,et al.  An improved density peaks clustering algorithm with fast finding cluster centers , 2018, Knowl. Based Syst..

[43]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[44]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[45]  Alexander Schliep,et al.  Ranking and selecting clustering algorithms using a meta-learning approach , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[46]  Shifei Ding,et al.  A semi-supervised approximate spectral clustering algorithm based on HMRF model , 2018, Inf. Sci..

[47]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .