MDCGen: Multidimensional Dataset Generator for Clustering

We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.

[1]  Joydeep Ghosh,et al.  Model-based overlapping clustering , 2005, KDD '05.

[2]  Yaling Pei,et al.  A Synthetic Data Generator for Clustering and Outlier Analysis , 2006 .

[3]  N. Higham COMPUTING A NEAREST SYMMETRIC POSITIVE SEMIDEFINITE MATRIX , 1988 .

[4]  A. Zimek,et al.  On Using Class-Labels in Evaluation of Clusterings , 2010 .

[5]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[6]  J. Korzeniewski Empirical Evaluation of OCLUS and GenRandomClust Algorithms of Generating Cluster Structures , 2013 .

[7]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[8]  Arthur Zimek,et al.  A Framework for Clustering Uncertain Data , 2015, Proc. VLDB Endow..

[9]  Daniele Mortari,et al.  On the Rigid Rotation Concept in n-Dimensional Spaces: Part II , 2001 .

[10]  Joshua D. Knowles,et al.  Multiobjective clustering around medoids , 2005, 2005 IEEE Congress on Evolutionary Computation.

[11]  Robert Henson,et al.  OCLUS: An Analytic Method for Generating Clusters with Known Overlap , 2005, J. Classif..

[12]  Michel Verleysen,et al.  The Concentration of Fractional Distances , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[14]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .