Bayesian Inference for Gene Expression and Proteomics: Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model

This chapter describes a clustering procedure for microarray expression data based on a well-defined statistical model, specifically, a conjugate Dirichlet process mixture model. The clustering algorithm groups genes whose latent variables governing expression are equal, that is, genes belonging to the same mixture component. The model is fit with Markov chain Monte Carlo and the computational burden is eased by exploiting conjugacy. This chapter introduces a method to get a point estimate of the true clustering based on least-squares distances from the posterior probability that two genes are clustered. Unlike ad hoc clustering methods, the model provides measures of uncertainty about the clustering. Further, the model automatically estimates the number of clusters and quantifies uncertainty about this important parameter. The method is compared to other clustering methods in a simulation study. Finally, the method is demonstrated with actual microarray data.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[3]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[4]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[5]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[6]  Radford M. Neal Bayesian Mixture Modeling , 1992 .

[7]  M. Escobar Estimating Normal Means with a Dirichlet Process Prior , 1994 .

[8]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[9]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[10]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[11]  Jun S. Liu Nonparametric hierarchical Bayes via sequential imputations , 1996 .

[12]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Adrian E. Raftery,et al.  MCLUST: Software for Model-Based Cluster Analysis , 1999 .

[14]  Jun S. Liu,et al.  Sequential importance sampling for nonparametric Bayes models: The next generation , 1999 .

[15]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[16]  M. Newton,et al.  Computational Aspects of Nonparametric Bayesian Analysis with Applications to the Modeling of Multiple Binary Sequences , 2000 .

[17]  Adrian E. Raftery,et al.  Model-based clustering and data transformations for gene expression data , 2001, Bioinform..

[18]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[19]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[20]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[21]  Deepayan Sarkar,et al.  Age-related impairment of the transcriptional responses to oxidative stress in the mouse heart. , 2003, Physiological genomics.

[22]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[23]  Fernando A. Quintana,et al.  Nonparametric Bayesian data analysis , 2004 .

[24]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[25]  Ka Yee Yeung,et al.  Bayesian mixture model based clustering of replicated microarray data , 2004, Bioinform..

[26]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.