Model Selection for Semi-Supervised Clustering

Although there is a large and growing literature that tackles the semi-supervised clustering problem (i.e., using some labeled objects or cluster-guiding constraints like \must-link" or \cannot-link"), the evaluation of semi-supervised clustering approaches has rarely been discussed. The application of cross-validation techniques, for example, is far from straightforward in the semi-supervised setting, yet the problems associated with evaluation have yet to be addressed. Here we summarize these problems and provide a solution. Furthermore, in order to demonstrate practical applicability of semi-supervised clustering methods, we provide a method for model selection in semi-supervised clustering based on this sound evaluation procedure. Our method allows the user to select, based on the available information (labels or constraints), the most appropriate clustering model (e.g., number of clusters, density-parameters) for a given problem.

[1]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[2]  Jörg Sander,et al.  Semi-supervised Density-Based Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[3]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[4]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[5]  Myra Spiliopoulou,et al.  Density-based semi-supervised clustering , 2010, Data Mining and Knowledge Discovery.

[6]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[7]  Cláudia Antunes,et al.  Semi-supervised Clustering: A Case Study , 2012, MLDM.

[8]  Anil K. Jain,et al.  Clustering with Soft and Group Constraints , 2004, SSPR/SPR.

[9]  Christian Böhm,et al.  HISSCLU: a hierarchical density-based method for semi-supervised clustering , 2008, EDBT '08.

[10]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[11]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[12]  M. Cugmas,et al.  On comparing partitions , 2015 .

[13]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[14]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.

[15]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[16]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[17]  Nenghai Yu,et al.  Learning Bregman Distance Functions for Semi-Supervised Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[18]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[19]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[20]  Myra Spiliopoulou,et al.  C-DBSCAN: Density-Based Clustering with Constraints , 2009, RSFDGrC.

[21]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[22]  Claire Cardie,et al.  Intelligent Clustering with Instance-Level Constraints , 2002 .

[23]  Arthur Zimek,et al.  A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies , 2013, Data Mining and Knowledge Discovery.

[24]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[25]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[26]  Elke Achtert,et al.  Evaluation of Clusterings -- Metrics and Visual Support , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[27]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[28]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[29]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[30]  Peng Li,et al.  A Variational Approach to Semi-Supervised Clustering , 2009, ESANN.

[31]  Tao Li,et al.  Semi-supervised Hierarchical Clustering , 2011, 2011 IEEE 11th International Conference on Data Mining.

[32]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[33]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[34]  S. S. Ravi,et al.  The complexity of non-hierarchical clustering with instance and cluster level constraints , 2007, Data Mining and Knowledge Discovery.

[35]  Amine Bensaid,et al.  Data mining for text categorization with semi-supervised agglomerative hierarchical clustering , 2000, Int. J. Intell. Syst..

[36]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[37]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[38]  Günther Palm,et al.  On the Effects of Constraints in Semi-supervised Hierarchical Clustering , 2006, ANNPR.

[39]  Roger E Bumgarner,et al.  Clustering gene-expression data with repeated measurements , 2003, Genome Biology.

[40]  Ricardo J. G. B. Campello,et al.  Automatic aspect discrimination in data clustering , 2012, Pattern Recognit..

[41]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[42]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[43]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.