Sparse Bayesian hierarchical modeling of high-dimensional clustering problems

Clustering is one of the most widely used procedures in the analysis of microarray data, for example with the goal of discovering cancer subtypes based on observed heterogeneity of genetic marks between different tissues. It is well known that in such high-dimensional settings, the existence of many noise variables can overwhelm the few signals embedded in the high-dimensional space. We propose a novel Bayesian approach based on Dirichlet process with a sparsity prior that simultaneous performs variable selection and clustering, and also discover variables that only distinguish a subset of the cluster components. Unlike previous Bayesian formulations, we use Dirichlet process (DP) for both clustering of samples as well as for regularizing the high-dimensional mean/variance structure. To solve the computational challenge brought by this double usage of DP, we propose to make use of a sequential sampling scheme embedded within Markov chain Monte Carlo (MCMC) updates to improve the naive implementation of existing algorithms for DP mixture models. Our method is demonstrated on a simulation study and illustrated with the leukemia gene expression dataset.

[1]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[2]  Jun S. Liu,et al.  Bayesian Clustering with Variable and Transformation Selections , 2003 .

[3]  M. Escobar,et al.  Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[4]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[5]  M. Stephens Dealing with label switching in mixture models , 2000 .

[6]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[7]  Marina Vannucci,et al.  Variable selection in clustering via Dirichlet process mixture models , 2006 .

[8]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[9]  H. Chipman,et al.  BART: Bayesian Additive Regression Trees , 2008, 0806.3286.

[10]  E. George,et al.  APPROACHES FOR BAYESIAN VARIABLE SELECTION , 1997 .

[11]  D. V. van Dyk,et al.  Partially Collapsed Gibbs Samplers , 2008 .

[12]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[13]  Wenxuan Zhong,et al.  Penalized Clustering of Large-Scale Functional Data With Multiple Covariates , 2008, 0801.2555.

[14]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[15]  Hongzhe Li,et al.  Clustering of time-course gene expression data using a mixed-effects model with B-splines , 2003, Bioinform..

[16]  Wei Pan,et al.  Penalized Model-Based Clustering with Application to Variable Selection , 2007, J. Mach. Learn. Res..

[17]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[18]  A. Gelfand,et al.  The Nested Dirichlet Process , 2008 .

[19]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  Carlos M. Carvalho,et al.  Sparse Statistical Modelling in Gene Expression Genomics , 2006 .

[22]  Pascal J. Goldschmidt-Clermont,et al.  Of mice and men: Sparse statistical modeling in cardiovascular genomics , 2007, 0709.0165.

[23]  Adrian E. Raftery,et al.  MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering † , 2007 .

[24]  Ping Ma,et al.  Bayesian Inference for Gene Expression and Proteomics , 2007, Briefings Bioinform..

[25]  J. Ibrahim,et al.  Bayesian Models for Gene Expression With DNA Microarray Data , 2002 .

[26]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[27]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[28]  David J. Nott,et al.  Predictive performance of Dirichlet process shrinkage methods in linear regression , 2008, Comput. Stat. Data Anal..

[29]  J. Thomas,et al.  An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. , 2001, Genome research.

[30]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .

[31]  H. Chipman,et al.  Bayesian Additive Regression Trees , 2006 .

[32]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[33]  M. West,et al.  High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics , 2008, Journal of the American Statistical Association.

[34]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[35]  Peter D. Hoff,et al.  Model-based subspace clustering , 2006 .

[36]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .