Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters

There are two notoriously hard problems in cluster analysis, estimating the number of clusters, and checking whether the population to be clustered is not actually homogeneous. Given a dataset, a clustering method and a cluster validation index, this paper proposes to set up null models that capture structural features of the data that cannot be interpreted as indicating clustering. Artificial datasets are sampled from the null model with parameters estimated from the original dataset. This can be used for testing the null hypothesis of a homogeneous population against a clustering alternative. It can also be used to calibrate the validation index for estimating the number of clusters, by taking into account the expected distribution of the index under the null model for any given number of clusters. The approach is illustrated by three examples, involving various different clustering techniques (partitioning around medoids, hierarchical methods, a Gaussian mixture model), validation indexes (average silhouette width, prediction strength and BIC), and issues such as mixed-type data, temporal and spatial autocorrelation.

[1]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[2]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[3]  Adrian E. Raftery,et al.  mclust Version 4 for R : Normal Mixture Modeling for Model-Based Clustering , Classification , and Density Estimation , 2012 .

[4]  Fritz Drasgow,et al.  Polychoric and Polyserial Correlations , 2006 .

[5]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[6]  Christian Hennig,et al.  Biotic element analysis in biogeography. , 2003, Systematic biology.

[7]  C. Lin,et al.  A pattern-clustering method for longitudinal data - heroin users receiving methadone , 2014 .

[8]  Hui Xiong,et al.  Clustering Validation Measures , 2018, Data Clustering: Algorithms and Applications.

[9]  Christian Hennig,et al.  Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification , 2010 .

[10]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[11]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[12]  H. Bock Probabilistic models in cluster analysis , 1996 .

[13]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[14]  C. Hennig,et al.  The influence of recent geography, palaeogeography and climate on the composition of the fauna of the central Aegean Islands , 2005 .

[15]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[16]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[17]  Christian Hennig,et al.  Distance-based parametric bootstrap tests for clustering of species ranges , 2004, Comput. Stat. Data Anal..

[18]  Stefanie Seiler,et al.  Finding Groups In Data , 2016 .

[19]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[20]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[21]  Christian Hennig,et al.  Clustering and a Dissimilarity Measure for Methadone Dosage Time Series , 2014, ECDA.