How many data clusters are in the Galaxy data set?

In model-based clustering, the Galaxy data set is often used as a benchmark data set tostudy the performance of different modeling approaches. Aitkin (Stat Model 1:287–304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the priorassumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin’s concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model wherea prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications arerecommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also,the regularizing properties of the priors may be intentionally exploited to obtain asuitable clustering solution meeting prior expectations and needs of the application.

[1]  M. Aitkin Likelihood and Bayesian analysis of mixtures , 2001 .

[2]  S. Frühwirth-Schnatter,et al.  Spying on the prior of the number of data clusters and the partition distribution in Bayesian cluster analysis , 2020, Australian & New Zealand Journal of Statistics.

[3]  Murray Aitkin,et al.  Statistical Modelling of Data on Teaching Styles , 1981 .

[4]  Agostino Nobile,et al.  On the posterior distribution of the number of components in a finite mixture , 2004, math/0503673.

[5]  L. Wasserman,et al.  Practical Bayesian Density Estimation Using Mixtures of Normals , 1997 .

[6]  Sylvia Frühwirth-Schnatter,et al.  Finite Mixture and Markov Switching Models , 2006 .

[7]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[8]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[9]  P. McCullagh,et al.  How many clusters , 2008 .

[10]  Gertraud Malsiner-Walli,et al.  Dynamic mixtures of finite mixtures and telescoping sampling , 2020 .

[11]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Ulric J. Lund A Handbook of Statistical Analyses Using R , 2007 .

[14]  B. Carlin,et al.  Bayesian Model Choice Via Markov Chain Monte Carlo Methods , 1995 .

[15]  Jeffrey W. Miller,et al.  Mixture Models With a Prior on the Number of Components , 2015, Journal of the American Statistical Association.

[16]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[17]  C. Hennig,et al.  How to find an appropriate clustering for mixed‐type variables with application to socio‐economic stratification , 2013 .

[18]  Sylvia Fruhwirth-Schnatter,et al.  Generalized Mixtures of Finite Mixtures and Telescoping Sampling , 2020, Bayesian Analysis.

[19]  Walter R. Gilks,et al.  Bayesian model comparison via jump diffusions , 1995 .

[20]  Gertraud Malsiner-Walli,et al.  Model-based clustering based on sparse finite Gaussian mixtures , 2014, Statistics and Computing.

[21]  David J. Lunn,et al.  The BUGS Book: A Practical Introduction to Bayesian Analysis , 2013 .

[22]  K. Roeder Density estimation with confidence sets exemplified by superclusters and voids in the galaxies , 1990 .

[23]  M. Postman,et al.  Probes of large-scale structure in the Corona Borealis region. , 1986 .

[24]  Paul D. McNicholas,et al.  Model-Based Clustering , 2016, Journal of Classification.

[25]  M. Pierre,et al.  Probes for the large-scale structure , 1990 .

[26]  Luca Scrucca,et al.  mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models , 2016, R J..

[27]  M. Degroot,et al.  Modeling lake-chemistry distributions: approximate Bayesian methods for estimating a finite-mixture model , 1992 .

[28]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .