Dealing with Overlapping Clustering: A Constraint-based Approach to Algorithm Selection

When confronted to a clustering problem, one has to choose which algorithm to run. Building a system that automatically chooses an algorithm for a given task is the algorithm selection problem. Unlike the well-studied task of classification, clustering algorithm selection cannot rely on labels to choose which algorithm to use. However, in the context of constraint-based clustering, we argue that using constraints can help in the algorithm selection process. We introduce CBO value, a measure based on must-link and cannot-link constraints that quantifies the overlapping in a dataset. We demonstrate its usefulness by choosing between two clustering algorithm, EM and spectral clustering. This simple method shows an average performance increase, demonstrating the potential of using constraints in clustering algorithm selection.

[1]  Francisco de A. T. de Carvalho,et al.  An Analysis of Meta-learning Techniques for Ranking Clustering Algorithms Applied to Artificial Data , 2009, ICANN.

[2]  Tias Guns,et al.  Constrained Clustering Using Column Generation , 2014, CPAIOR.

[3]  Marcílio Carlos Pereira de Souto,et al.  Selecting Machine Learning Algorithms Using the Ranking Meta-Learning Approach , 2011, Meta-Learning in Computational Intelligence.

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Myra Spiliopoulou,et al.  C-DBSCAN: Density-Based Clustering with Constraints , 2009, RSFDGrC.

[6]  Sejong Oh A new dataset evaluation method based on category overlap , 2011, Comput. Biol. Medicine.

[7]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.

[8]  Alexander Schliep,et al.  Ranking and selecting clustering algorithms using a meta-learning approach , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[9]  Tomer Hertz,et al.  Learning a Mahalanobis Metric from Equivalence Constraints , 2005, J. Mach. Learn. Res..

[10]  Thi-Bich-Hanh Dao,et al.  Constrained clustering by constraint programming , 2017, Artif. Intell..

[11]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[12]  Celine Vens,et al.  Generalizing from Example Clusters , 2013, Discovery Science.

[13]  Dan Pelleg,et al.  K -Means with Large and Noisy Constraint Sets , 2007, ECML.

[14]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[15]  Ian Davidson,et al.  Flexible constrained spectral clustering , 2010, KDD.

[16]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[17]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[18]  L. Hubert,et al.  Comparing partitions , 1985 .

[19]  Leandro Nunes de Castro,et al.  Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods , 2015, Inf. Sci..

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .