Improved criteria for clustering based on the posterior similarity matrix

In this paper we address the problem of obtaining a single clustering estimate bc based on an MCMC sample of clusterings c (1) ;c (2) :::;c (M) from the posterior distribution of a Bayesian cluster model. Methods to derive b when the number of groups K varies between the clusterings are reviewed and discussed. These include the maximum a posteriori (MAP) estimate and methods based on the posterior similarity matrix, a matrix containing the posterior probabilities that the observations i and j are in the same cluster. The posterior similarity matrix is related to a commonly used loss function by Binder (1978). Minimization of the loss is shown to be equivalent to maximizing the Rand index between esti- mated and true clustering. We propose new criteria for estimating a clustering, which are based on the posterior expected adjusted Rand index. The criteria are shown to possess a shrinkage property and outperform Binder's loss in a simulation study and in an application to gene expression data. They also perform favorably compared to other clustering procedures.

[1]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[2]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[3]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[4]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  D. Binder Bayesian cluster analysis , 1978 .

[7]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[8]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[9]  R. Berk,et al.  Continuous Univariate Distributions, Volume 2 , 1995 .

[10]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[11]  Adrian E. Raftery,et al.  Inference in model-based cluster analysis , 1997, Stat. Comput..

[12]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[13]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[14]  M. Stephens Dealing with label switching in mixture models , 2000 .

[15]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[16]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[17]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[18]  F. Quintana,et al.  Bayesian clustering and product partition models , 2003 .

[19]  C. Robert,et al.  Estimating Mixtures of Regressions , 2003 .

[20]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[21]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[22]  J. E. Griffin,et al.  Order-Based Dependent Dirichlet Processes , 2006 .

[23]  Shane T. Jensen,et al.  Bayesian Clustering of Transcription Factor Binding Motifs , 2006, math/0610655.

[24]  D. B. Dahl Bayesian Inference for Gene Expression and Proteomics: Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model , 2006 .

[25]  Marina Vannucci,et al.  Variable selection in clustering via Dirichlet process mixture models , 2006 .

[26]  Zhaohui S. Qin,et al.  Clustering microarray gene expression data using weighted Chinese restaurant process , 2006, Bioinform..

[27]  Ramsés H. Mena,et al.  Controlling the reinforcement in Bayesian non‐parametric mixture models , 2007 .

[28]  M. Newton,et al.  Multiple Hypothesis Testing by Clustering Treatment Effects , 2007 .

[29]  P. Green,et al.  Bayesian Model-Based Clustering Procedures , 2007 .

[30]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[31]  David B. Dunson,et al.  Bayesian Nonparametrics: Nonparametric Bayes applications to biostatistics , 2010 .