Clusternomics: Integrative context-dependent clustering for heterogeneous datasets

Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.

[1]  George Michailidis,et al.  Critical limitations of consensus clustering in class discovery , 2014, Scientific Reports.

[2]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[3]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[4]  Yee Whye Teh,et al.  Hybrid Variational/Gibbs Collapsed Inference in Topic Models , 2008, UAI.

[5]  Bradley P. Carlin,et al.  Bayesian measures of model complexity and fit , 2002 .

[6]  A. Frigessi,et al.  Principles and methods of integrative genomic analyses in cancer , 2014, Nature Reviews Cancer.

[7]  Matthew T. Harrison,et al.  A simple example of Dirichlet process mixture inconsistency for the number of components , 2013, NIPS.

[8]  P. Green,et al.  Modelling Heterogeneity With and Without the Dirichlet Process , 2001 .

[9]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[10]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[11]  C. Sander,et al.  Integrative Subtype Discovery in Glioblastoma Using iCluster , 2012, PloS one.

[12]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[13]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[14]  Swapnil Mishra,et al.  Experiments with non-parametric topic models , 2014, KDD.

[15]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[16]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[17]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[18]  Michael A. West,et al.  Hierarchical priors and mixture models, with applications in regression and density estimation , 2006 .

[19]  Mario Medvedovic,et al.  Bayesian infinite mixture model based clustering of gene expression profiles , 2002, Bioinform..

[20]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[21]  Mario Medvedovic,et al.  Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data , 2007, BMC Bioinformatics.

[22]  P. Müller,et al.  10 Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model , 2006 .

[23]  Magne Thoresen,et al.  Integrative clustering of high-dimensional data with joint and individual clusters. , 2014, Biostatistics.

[24]  David B. Dunson,et al.  Improving prediction from dirichlet process mixtures via enrichment , 2014, J. Mach. Learn. Res..

[25]  Yee Whye Teh,et al.  Collapsed Variational Dirichlet Process Mixture Models , 2007, IJCAI.

[26]  Adrian E. Raftery,et al.  mclust Version 4 for R : Normal Mixture Modeling for Model-Based Clustering , Classification , and Density Estimation , 2012 .

[27]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[28]  M. Cugmas,et al.  On comparing partitions , 2015 .

[29]  K. Mengersen,et al.  Asymptotic behaviour of the posterior distribution in overfitted mixture models , 2011 .

[30]  Erik B. Sudderth,et al.  Reliable and Scalable Variational Inference for the Hierarchical Dirichlet Process , 2015, AISTATS.

[31]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[32]  H. Ishwaran,et al.  Exact and approximate sum representations for the Dirichlet process , 2002 .

[33]  K. Ovaska,et al.  Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme , 2010, Genome Medicine.

[34]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[35]  R. Weinberg,et al.  The Biology of Cancer , 2006 .

[36]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[37]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[38]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.