Evolving controllably difficult datasets for clustering

Synthetic datasets play an important role in evaluating clustering algorithms, as they can help shed light on consistent biases, strengths, and weaknesses of particular techniques, thereby supporting sound conclusions. Despite this, there is a surprisingly small set of established clustering benchmark data, and many of these are currently handcrafted. Even then, their difficulty is typically not quantified or considered, limiting the ability to interpret algorithmic performance on these datasets. Here, we introduce HAWKS, a new data generator that uses an evolutionary algorithm to evolve cluster structure of a synthetic data set. We demonstrate how such an approach can be used to produce datasets of a pre-specified difficulty, to trade off different aspects of problem difficulty, and how these interventions directly translate into changes in the clustering performance of established algorithms.

[1]  Xin Yao,et al.  Stochastic ranking for constrained evolutionary optimization , 2000, IEEE Trans. Evol. Comput..

[2]  Pasi Fränti,et al.  K-means properties on six clustering benchmark datasets , 2018, Applied Intelligence.

[3]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[4]  Gordon Fraser,et al.  EvoSuite: automatic test suite generation for object-oriented software , 2011, ESEC/FSE '11.

[5]  R. Geoff Dromey,et al.  An algorithm for the selection problem , 1986, Softw. Pract. Exp..

[6]  San Cristóbal Mateo,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996 .

[7]  Núria Macià,et al.  Towards UCI+: A mindful repository design , 2014, Inf. Sci..

[8]  Joshua D. Knowles,et al.  Multi-Objective Clustering and Cluster Validation , 2006, Multi-Objective Machine Learning.

[9]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[10]  M. Cugmas,et al.  On comparing partitions , 2015 .

[11]  Jano I van Hemert,et al.  Evolving combinatorial problem instances that are difficult to solve. , 2006, Evolutionary computation.

[12]  Kate Smith-Miles,et al.  Generating new test instances by evolving in instance space , 2015, Comput. Oper. Res..

[13]  G. Stewart The Efficient Generation of Random Orthogonal Matrices with an Application to Condition Estimators , 1980 .

[14]  Núria Macià,et al.  In search of targeted-complexity problems , 2010, GECCO '10.

[15]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[16]  Jason H. Moore,et al.  Erratum to: Evolving hard problems: generating human genetics datasets with a complex etiology , 2016, BioData Mining.

[17]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[18]  A. Azzouz 2011 , 2020, City.

[19]  John N. Hooker,et al.  Testing heuristics: We have it all wrong , 1995, J. Heuristics.

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Kate Smith-Miles,et al.  Measuring instance difficulty for combinatorial optimization problems , 2012, Comput. Oper. Res..

[22]  Kate Smith-Miles,et al.  Instance spaces for machine learning classification , 2017, Machine Learning.

[23]  Joshua D. Knowles,et al.  Improvements to the scalability of multiobjective clustering , 2005, 2005 IEEE Congress on Evolutionary Computation.

[24]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[25]  Kate Smith-Miles,et al.  Towards objective measures of algorithm performance across instance space , 2014, Comput. Oper. Res..

[26]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[27]  Guanrong Chen,et al.  Evolving benchmark functions using kruskal-wallis test , 2018, GECCO.

[28]  Anne Auger,et al.  Markov Chain Analysis of Cumulative Step-Size Adaptation on a Linear Constrained Problem , 2015, Evolutionary Computation.

[29]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..