Bayesian Clustering with Variable and Transformation Selections

SUMMARY The clustering problem has attracted much attention from both statisticians and computer scientists in the past fifty years. Methods such as hierarchical clustering and the K-means method are convenient and competitive first choices off the shelf for the scientist. Gaussian mixture modeling is another popular but computationally expensive clustering strategy, especially when the data is of high-dimensional. We propose to first conduct a principal component analysis (PCA) or correspondence analysis (CA) for dimension reduction, and then fit Gaussian mixtures to the data projected to the several major PCA or CA directions. Two technical difficulties of this approach are: (a) the selection of a subset of the PCA factors that are informative for clustering, and (b) the selection of a proper transformation for each factor. We propose a Bayesian formulation and Markov chain Monte Carlo strategies that overcome the two difficulties and examine the performances of the new method by both simulation studies and real applications in molecular imaging analysis and DNA microarray analysis.

[1]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[2]  G. Schnell,et al.  A phenetic study of the suborder Lari (Aves). II. Phenograms, discussion, and conclusions. , 1970, Systematic zoology.

[3]  G. Schnell A Phenetic Study of the Suborder Lari (Aves) I. Methods and Results of Principal Components Analyses , 1970 .

[4]  M. Thomason Interactive Pattern Recognition , 1981, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  J. Frank,et al.  Correspondence analysis of aligned images of biological particles. , 1982, Journal of molecular biology.

[6]  Wei-Chien Chang On using Principal Components before Separating a Mixture of Two Multivariate Normal Distributions , 1983 .

[7]  J. Frank,et al.  Three‐dimensional reconstruction from a single‐exposure, random conical tilt series applied to the 50S ribosomal subunit of Escherichia coli , 1987, Journal of microscopy.

[8]  R. Wolpert,et al.  Additional References in the Discussion , 1988 .

[9]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[10]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[11]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[12]  C. Robert,et al.  Estimation of Finite Mixture Distributions Through Bayesian Sampling , 1994 .

[13]  J. York,et al.  Bayesian Graphical Models for Discrete Data , 1995 .

[14]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[15]  P. Green,et al.  Alternative approaches to cluster-based market segmentation , 1995 .

[16]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[17]  G. Celeux,et al.  Stochastic versions of the em algorithm: an experimental study in the mixture case , 1996 .

[18]  Radford M. Neal Sampling from multimodal distributions using tempered transitions , 1996, Stat. Comput..

[19]  Jun S. Liu,et al.  Predictive updating methods with application to Bayesian classification , 1996 .

[20]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[21]  Adrian E. Raftery,et al.  Inference in model-based cluster analysis , 1997, Stat. Comput..

[22]  P. Green,et al.  Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[23]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[24]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[25]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[26]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[27]  Edwin R. Hancock,et al.  A mixture model for pose clustering , 1999, Pattern Recognit. Lett..

[28]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[29]  Francisco J. Prieto,et al.  The kurtosis coefficient and the linear discriminant function , 2000 .

[30]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[31]  D. Madigan,et al.  Correction to: ``Bayesian model averaging: a tutorial'' [Statist. Sci. 14 (1999), no. 4, 382--417; MR 2001a:62033] , 2000 .

[32]  Padhraic Smyth,et al.  Visualization of navigation patterns on a Web site using model-based clustering , 2000, KDD '00.

[33]  M. Stephens Dealing with label switching in mixture models , 2000 .

[34]  C. Robert,et al.  Computational and Inferential Difficulties with Mixture Posterior Distributions , 2000 .

[35]  Lancelot F. James,et al.  Bayesian Model Selection in Finite Mixtures by Marginal Density Decompositions , 2001 .

[36]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[37]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[38]  Stephen P. Brooks,et al.  On Bayesian analyses and finite mixtures for proportions , 2001, Stat. Comput..

[39]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[40]  Jun S. Liu,et al.  A Bayesian method for classification of images from electron micrographs. , 2002, Journal of structural biology.

[41]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[42]  Alejandro Murua,et al.  Hierarchical model-based clustering of large datasets through fractionation and refractionation , 2002, Inf. Syst..

[43]  Lutgarde M. C. Buydens,et al.  Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cem.721 Mixture modelling of medical magnetic resonance data , 2002 .

[44]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[45]  Jun S. Liu Bioinformatics : Microarrays Analyses and Beyond , 2002 .

[46]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[47]  Hyun-Chul Kim,et al.  A numeral character recognition using the PCA mixture model , 2002, Pattern Recognit. Lett..

[48]  Lee Ann McCue,et al.  Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites , 2003, Nature Biotechnology.

[49]  G. Casella,et al.  Mixture models, latent variables and partitioned importance sampling , 2004 .

[50]  Alejandro Murua,et al.  Hierarchical model-based clustering of large datasets through fractionation and refractionation , 2004, Inf. Syst..