Outcome-Driven Cluster Analysis with Application to Microarray Data

One goal of cluster analysis is to sort characteristics into groups (clusters) so that those in the same group are more highly correlated to each other than they are to those in other groups. An example is the search for groups of genes whose expression of RNA is correlated in a population of patients. These genes would be of greater interest if their common level of RNA expression were additionally predictive of the clinical outcome. This issue arose in the context of a study of trauma patients on whom RNA samples were available. The question of interest was whether there were groups of genes that were behaving similarly, and whether each gene in the cluster would have a similar effect on who would recover. For this, we develop an algorithm to simultaneously assign characteristics (genes) into groups of highly correlated genes that have the same effect on the outcome (recovery). We propose a random effects model where the genes within each group (cluster) equal the sum of a random effect, specific to the observation and cluster, and an independent error term. The outcome variable is a linear combination of the random effects of each cluster. To fit the model, we implement a Markov chain Monte Carlo algorithm based on the likelihood of the observed data. We evaluate the effect of including outcome in the model through simulation studies and describe a strategy for prediction. These methods are applied to trauma data from the Inflammation and Host Response to Injury research program, revealing a clustering of the genes that are informed by the recovery outcome.

[1]  A. Dupuy,et al.  Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. , 2007, Journal of the National Cancer Institute.

[2]  A. Gelman Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) , 2004 .

[3]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[4]  Peter Bühlmann,et al.  Supervised clustering of genes , 2002, Genome Biology.

[5]  D. Dunson,et al.  Bayesian Selection and Clustering of Polymorphisms in Functionally Related Genes , 2008 .

[6]  G. Casella,et al.  Clustering using objective functions and stochastic search , 2008 .

[7]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[9]  Debashis Ghosh,et al.  Mixture modelling of gene expression data from microarray experiments , 2002, Bioinform..

[10]  David A. Schoenfeld,et al.  Survival analysis of longitudinal microarrays , 2006, Bioinform..

[11]  M. Vannucci,et al.  Bayesian Variable Selection in Clustering High-Dimensional Data , 2005 .

[12]  Peter Bühlmann,et al.  Finding predictive gene groups from microarray data , 2004 .

[13]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[14]  Kui Wang,et al.  A Mixture model with random-effects components for clustering correlated gene-expression profiles , 2006, Bioinform..

[15]  Olaf Wolkenhauer,et al.  A fully Bayesian model to cluster gene-expression profiles , 2005, ECCB/JBI.

[16]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[17]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[18]  P. Gustafson,et al.  Conservative prior distributions for variance parameters in hierarchical models , 2006 .

[19]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[20]  Bin Yu,et al.  Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL , 2003, Bioinform..

[21]  Lu Tian,et al.  Linking gene expression data with patient survival times using partial least squares , 2002, ISMB.