MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms

The R package MixSim is a new tool that allows simulating mixtures of Gaussian distributions with different levels of overlap between mixture components. Pairwise overlap, defined as a sum of two misclassification probabilities, measures the degree of interaction between components and can be readily employed to control the clustering complexity of datasets simulated from mixtures. These datasets can then be used for systematic performance investigation of clustering and finite mixture modeling algorithms. Among other capabilities of MixSim, there are computing the exact overlap for Gaussian mixtures, simulating Gaussian and non-Gaussian data, simulating outliers and noise variables, calculating various measures of agreement between two partitionings, and constructing parallel distribution plots for the graphical display of finite mixture models. All features of the package are illustrated in great detail. The utility of the package is highlighted through a small comparison study of several popular clustering algorithms.

[1]  R. Davies The distribution of a linear combination of 2 random variables , 1980 .

[2]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[3]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[4]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[5]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[6]  Robert Henson,et al.  OCLUS: An Analytic Method for Generating Clusters with Known Overlap , 2005, J. Classif..

[7]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[8]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[9]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[10]  Michael A. Malcolm,et al.  Computer methods for mathematical computations , 1977 .

[11]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[12]  David L. Wallace,et al.  A Method for Comparing Two Hierarchical Clusterings: Comment , 1983 .

[13]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..

[14]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[15]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[16]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[17]  Ranjan Maitra,et al.  CARP: Software for Fishing Out Good Clustering Algorithms , 2011, J. Mach. Learn. Res..

[18]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[19]  Volodymyr Melnykov,et al.  Finite mixture models and model-based clustering , 2010 .

[20]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[22]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[23]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.