Studying Complexity of Model-based Clustering

Cluster analysis is a popular statistics and computer science technique commonly used in various areas of research. In this article, we investigate factors that can influence clustering performance in the model-based clustering framework. The four factors considered are the level of overlap, number of clusters, number of dimensions, and sample size. Through a comprehensive simulation study, we investigate model-based clustering in different settings. As a measure of clustering performance, we employ three popular classification indices capable of reflecting the degree of agreement in two partitioning vectors, thus making the comparison between the true and estimated classification vectors possible. In addition to studying clustering complexity, the performance of the three classification measures is evaluated.

[1]  Robert Henson,et al.  OCLUS: An Analytic Method for Generating Clusters with Known Overlap , 2005, J. Classif..

[2]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[3]  Baoping Yan,et al.  Exploring the Spatial Distribution of Bird Habitat with Cluster Analysis , 2009, 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science.

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Volodymyr Melnykov,et al.  Challenges in model‐based clustering , 2013 .

[6]  Wei-Chen Chen,et al.  MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms , 2012 .

[7]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[8]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[9]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[10]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[11]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[12]  Christophe Biernacki,et al.  Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models , 2003, Comput. Stat. Data Anal..

[13]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..

[14]  Ranjan Maitra,et al.  CARP: Software for Fishing Out Good Clustering Algorithms , 2011, J. Mach. Learn. Res..

[15]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[16]  Stefanie Seiler,et al.  Finding Groups In Data , 2016 .

[17]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[18]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[19]  R. Maitra,et al.  Initializing Partition-Optimization Algorithms , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[21]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[22]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[23]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[24]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[25]  I. C. Gormley,et al.  A mixture of experts model for rank data with applications in election studies , 2008, 0901.4203.

[26]  Wei-Chen Chen,et al.  Model‐based clustering of regression time series data via APECM—an AECM algorithm sung to an even faster beat , 2011, Stat. Anal. Data Min..

[27]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[28]  M. Cugmas,et al.  On comparing partitions , 2015 .

[29]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[30]  P. Sneath The application of computers to taxonomy. , 1957, Journal of general microbiology.

[31]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[32]  Charles Bouveyron,et al.  Model-based clustering of high-dimensional data: A review , 2014, Comput. Stat. Data Anal..

[33]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[34]  Volodymyr Melnykov,et al.  Initializing the EM algorithm in Gaussian mixture models with an unknown number of components , 2012, Comput. Stat. Data Anal..

[35]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..