HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis

Comprehensive benchmarking of clustering algorithms is rendered difficult by two key factors: (i) the elusiveness of a unique mathematical definition of this unsupervised learning approach and (ii) dependencies between the generating models or clustering criteria adopted by some clustering algorithms and indices for internal cluster validation. Consequently, there is no consensus regarding the best practice for rigorous benchmarking, and whether this is possible at all outside the context of a given application. Here, we argue that synthetic datasets must continue to play an important role in the evaluation of clustering algorithms, but that this necessitates constructing benchmarks that appropriately cover the diverse set of properties that impact clustering algorithm performance. Through our framework, HAWKS, we demonstrate the important role evolutionary algorithms play to support flexible generation of such benchmarks, allowing simple modification and extension. We illustrate two possible uses of our framework: (i) the evolution of benchmark data consistent with a set of hand-derived properties and (ii) the generation of datasets that tease out performance differences between a given pair of algorithms. Our work has implications for the design of clustering benchmarks that sufficiently challenge a broad range of algorithms, and for furthering insight into the strengths and weaknesses of specific approaches.

[1]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[2]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[3]  Shengrui Wang,et al.  Measuring the component overlapping in the Gaussian mixture model , 2011, Data Mining and Knowledge Discovery.

[4]  Stan Lipovetsky,et al.  Tractable Measure of Component Overlap for Gaussian Mixture Models , 2014, 1407.7172.

[5]  Ujjwal Maulik,et al.  Multiobjective Genetic Algorithms for Clustering - Applications in Data Mining and Bioinformatics , 2011 .

[6]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[7]  Heike Trautmann,et al.  Improving the State of the Art in Inexact TSP Solving Using Per-Instance Algorithm Selection , 2015, LION.

[8]  Kate Smith-Miles,et al.  Instance spaces for machine learning classification , 2017, Machine Learning.

[9]  Joshua D. Knowles,et al.  Improvements to the scalability of multiobjective clustering , 2005, 2005 IEEE Congress on Evolutionary Computation.

[10]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[11]  Isabelle Guyon,et al.  Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[12]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[13]  Kenneth V. Price,et al.  An introduction to differential evolution , 1999 .

[14]  Xin Yao,et al.  Stochastic ranking for constrained evolutionary optimization , 2000, IEEE Trans. Evol. Comput..

[15]  Lars Kotthoff,et al.  Automated Machine Learning: Methods, Systems, Challenges , 2019, The Springer Series on Challenges in Machine Learning.

[16]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[17]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[18]  Kate Smith-Miles,et al.  Measuring algorithm footprints in instance space , 2012, 2012 IEEE Congress on Evolutionary Computation.

[19]  Joshua D. Knowles,et al.  Feature subset selection in unsupervised learning via multiobjective optimization , 2006 .

[20]  Pasi Fränti,et al.  K-means properties on six clustering benchmark datasets , 2018, Applied Intelligence.

[21]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[22]  I. Guyon,et al.  Benchmarking in cluster analysis: A white paper , 2018, 1809.10496.

[23]  Núria Macià,et al.  Towards UCI+: A mindful repository design , 2014, Inf. Sci..

[24]  A. Raftery,et al.  Model‐based clustering for social networks , 2007 .

[25]  Xin Yao,et al.  Stochastic Ranking Algorithm for Many-Objective Optimization Based on Multiple Indicators , 2016, IEEE Transactions on Evolutionary Computation.

[26]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[27]  Chris H. Q. Ding,et al.  K-nearest-neighbor consistency in data clustering: incorporating local information into global optimization , 2004, SAC '04.

[28]  Cesar H. Comin,et al.  Clustering algorithms: A comparative approach , 2016, PloS one.

[29]  Mike Preuss,et al.  Exploratory landscape analysis: advanced tutorial at GECCO 2017 , 2017, GECCO.

[30]  R. Geoff Dromey,et al.  An algorithm for the selection problem , 1986, Softw. Pract. Exp..

[31]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[32]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[33]  Shai Ben-David,et al.  Clustering - What Both Theoreticians and Practitioners Are Doing Wrong , 2018, AAAI.

[34]  Francisco de A. T. de Carvalho,et al.  An Analysis of Meta-learning Techniques for Ranking Clustering Algorithms Applied to Artificial Data , 2009, ICANN.

[35]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[36]  Leandro Nunes de Castro,et al.  Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods , 2015, Inf. Sci..

[37]  Bernd Bischl,et al.  Exploratory landscape analysis , 2011, GECCO '11.

[38]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[39]  Rui Xu,et al.  Clustering Algorithms in Biomedical Research: A Review , 2010, IEEE Reviews in Biomedical Engineering.

[40]  Kate Smith-Miles,et al.  Generating new test instances by evolving in instance space , 2015, Comput. Oper. Res..

[41]  Mohiuddin Ahmed,et al.  A survey of network anomaly detection techniques , 2016, J. Netw. Comput. Appl..

[42]  Leslie Pérez Cáceres,et al.  The irace package: Iterated racing for automatic algorithm configuration , 2016 .

[43]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..

[44]  Ning Xiong,et al.  Investigation of Mutation Strategies in Differential Evolution for Solving Global Optimization Problems , 2014, ICAISC.

[45]  Kate Smith-Miles,et al.  Measuring instance difficulty for combinatorial optimization problems , 2012, Comput. Oper. Res..

[46]  G. Stewart The Efficient Generation of Random Orthogonal Matrices with an Application to Condition Estimators , 1980 .

[47]  Joshua D. Knowles,et al.  An Improved and More Scalable Evolutionary Approach to Multiobjective Clustering , 2018, IEEE Transactions on Evolutionary Computation.

[48]  M. Cugmas,et al.  On comparing partitions , 2015 .

[49]  Kate Smith-Miles,et al.  Cross-disciplinary perspectives on meta-learning for algorithm selection , 2009, CSUR.

[50]  Mauro Birattari,et al.  Tuning Metaheuristics - A Machine Learning Perspective , 2009, Studies in Computational Intelligence.

[51]  J. Overhage,et al.  Sorting Things Out: Classification and Its Consequences , 2001, Annals of Internal Medicine.

[52]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[53]  John N. Hooker,et al.  Testing heuristics: We have it all wrong , 1995, J. Heuristics.

[54]  Kate Smith-Miles,et al.  Towards objective measures of algorithm performance across instance space , 2014, Comput. Oper. Res..

[55]  Andrew M. Webb,et al.  Evolving controllably difficult datasets for clustering , 2019, GECCO.

[56]  Zhen Ma,et al.  A review of algorithms for medical image segmentation and their applications to the female pelvic cavity , 2010, Computer methods in biomechanics and biomedical engineering.

[57]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[58]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[59]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[60]  L. Hubert Approximate Evaluation Techniques for the Single-Link and Complete-Link Hierarchical Clustering Procedures , 1974 .

[61]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[62]  Heike Trautmann,et al.  Benchmarking Evolutionary Algorithms: Towards Exploratory Landscape Analysis , 2010, PPSN.

[63]  Christian Hennig,et al.  What are the true clusters? , 2015, Pattern Recognit. Lett..

[64]  Sara Dolnicar,et al.  A Review of Data-Driven Market Segmentation in Tourism , 2002 .