A Bayesian two-way latent structure model for genomic data integration reveals few pan-genomic cluster subtypes in a breast cancer cohort

MOTIVATION Unsupervised clustering is important in disease subtyping, among having other genomic applications. As genomic data has become more multifaceted, how to cluster across data sources for more precise subtyping is an ever more important area of research. Many of the methods proposed so far, including iCluster and Cluster of Cluster Assignments, make an unreasonble assumption of a common clustering across all data sources, and those that do not are fewer and tend to be computationally intensive. RESULTS We propose a Bayesian parametric model for integrative, unsupervised clustering across data sources. In our two-way latent structure model, samples are clustered in relation to each specific data source, distinguishing it from methods like Cluster of Cluster Assignments and iCluster, but cluster labels have across-dataset meaning, allowing cluster information to be shared between data sources. A common scaling across data sources is not required, and inference is obtained by a Gibbs Sampler, which we improve with a warm start strategy and modified density functions to robustify and speed convergence. Posterior interpretation allows for inference on common clusterings occurring among subsets of data sources. An interesting statistical formulation of the model results in sampling from closed-form posteriors despite incorporation of a complex latent structure. We fit the model with Gaussian and more general densities, which influences the degree of across-dataset cluster label sharing. Uniquely among integrative clustering models, our formulation makes no nestedness assumptions of samples across data sources so that a sample missing data from one genomic source can be clustered according to its existing data sources. We apply our model to a Norwegian breast cancer cohort of ductal carcinoma in-situ and invasive tumors, comprised of somatic copy-number alteration, methylation and expression datasets. We find enrichment in the Her2 subtype and ductal carcinoma among those observations exhibiting greater cluster correspondence across expression and CNA data. In general, there are few pan-genomic clusterings, suggesting that models assuming a common clustering across genomic data sources might yield misleading results. IMPLEMENTATION AND AVAILABILITY The model is implemented in an R package called twl ("two-way latent"), available on CRAN. Data for analysis is available within the R package. CONTACT ORCID: 0000-0003-3174-1656. SUPPLEMENTARY MATERIAL Appendices available online include additional breast cancer subtyping analysis and model runs, comparison with leading integrative clustering methods, general statistical formulation, description of Gibbs sampler improvements, and analyses of METABRIC and TCGA cohorts.

[1]  Ron Shamir,et al.  Erratum to: Expression and methylation patterns partition luminal-A breast tumors into distinct prognostic subgroups , 2016, Breast Cancer Research.

[2]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[3]  R. Tibshirani,et al.  Repeated observation of breast tumor subtypes in independent gene expression data sets , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Tao Chen,et al.  GOFFA: Gene Ontology For Functional Analysis – A FDA Gene Ontology Tool for Analysis of Genomic and Proteomic Data , 2006, BMC Bioinformatics.

[5]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Magne Thoresen,et al.  Integrative clustering of high-dimensional data with joint and individual clusters. , 2014, Biostatistics.

[7]  J. Booth,et al.  Integrative Model-based clustering of microarray methylation and expression data , 2012, 1210.0702.

[8]  Christian Hennig,et al.  Robust Improper Maximum Likelihood: Tuning, Computation, and a Comparison With Other Methods for Robust Gaussian Clustering , 2014, 1406.0808.

[9]  Cheng Li,et al.  Lessons from a decade of integrating cancer copy number alterations with gene expression profiles , 2012, Briefings Bioinform..

[10]  D. Dunson,et al.  Bayesian latent variable models for mixed discrete outcomes. , 2005, Biostatistics.

[11]  J. Tost,et al.  Complete pipeline for Infinium(®) Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. , 2012, Epigenomics.

[12]  M. Dunning,et al.  Genome-driven integrated classification of breast cancer validated in over 7,500 samples , 2014, Genome Biology.

[13]  Lorenz Wernisch,et al.  Clusternomics: Integrative context-dependent clustering for heterogeneous datasets , 2017, bioRxiv.

[14]  A. Børresen-Dale,et al.  Copynumber: Efficient algorithms for single- and multi-track copy number segmentation , 2012, BMC Genomics.

[15]  M. Erlander,et al.  Assessment of the prognostic and predictive utility of the Breast Cancer Index (BCI): an NCIC CTG MA.14 study , 2012, Breast Cancer Research.

[16]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[17]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[18]  C. Hennig Breakdown points for maximum likelihood estimators of location–scale mixtures , 2004, math/0410073.

[19]  Brooke L. Fridley,et al.  Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm , 2017, PloS one.

[20]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[21]  Massimo Bernaschi,et al.  The hierarchical organization of natural protein interaction networks confers self-organization properties on pseudocells , 2015, BMC Systems Biology.

[22]  C. Sander,et al.  Pattern discovery and cancer gene identification in integrated cancer genomic data , 2013, Proceedings of the National Academy of Sciences.

[23]  Jing Hua,et al.  Non-negative matrix factorization for semi-supervised data clustering , 2008, Knowledge and Information Systems.

[24]  B. Fridley,et al.  Integrative clustering methods for high-dimensional molecular data. , 2014, Translational cancer research.

[25]  M. Gönen,et al.  Cellular and genetic diversity in the progression of in situ human breast carcinomas to an invasive phenotype. , 2010, The Journal of clinical investigation.

[26]  Nico Pfeifer,et al.  Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery , 2015, Bioinform..

[27]  A. Nobel,et al.  Supervised risk predictor of breast cancer based on intrinsic subtypes. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[28]  N. Navin,et al.  Clonal Evolution in Breast Cancer Revealed by Single Nucleus Genome Sequencing , 2014, Nature.

[29]  Marina Vannucci,et al.  A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. , 2018, Biostatistics.

[30]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[31]  Charles Swanton,et al.  Intratumor Heterogeneity: Seeing the Wood for the Trees , 2012, Science Translational Medicine.

[32]  Sijian Wang,et al.  SPARSE INTEGRATIVE CLUSTERING OF MULTIPLE OMICS DATA SETS. , 2013, The annals of applied statistics.

[33]  Jack Cuzick,et al.  A novel and fully automated mammographic texture analysis for risk prediction: results from two case-control studies , 2017, Breast Cancer Research.

[34]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[35]  A. Frigessi,et al.  Principles and methods of integrative genomic analyses in cancer , 2014, Nature Reviews Cancer.

[36]  C. Perou,et al.  The association between copy number aberration, DNA methylation and gene expression in tumor samples , 2018, Nucleic acids research.

[37]  Simen Myhre,et al.  Influence of DNA copy number and mRNA levels on the expression of breast cancer related proteins , 2013, Molecular oncology.

[38]  T. Sørlie,et al.  Molecular Features of Subtype-Specific Progression from Ductal Carcinoma In Situ to Invasive Breast Cancer. , 2016, Cell reports.

[39]  David J. Reiss,et al.  Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks , 2006, BMC Bioinformatics.

[40]  Benjamin J. Raphael,et al.  Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin , 2014, Cell.

[41]  Javier Cabrera,et al.  Analysis of Data From Viral DNA Microchips , 2001 .

[42]  Valeria Vitelli,et al.  Integrative clustering reveals a novel split in the luminal A subtype of breast cancer with impact on outcome , 2017, Breast Cancer Research.

[43]  Jean-Philippe Vert,et al.  Changes in correlation between promoter methylation and gene expression in cancer , 2015, BMC Genomics.

[44]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[45]  T. Sørlie,et al.  Molecular diversity in ductal carcinoma in situ (DCIS) and early invasive breast cancer , 2010, Molecular oncology.

[46]  Isabella Castiglioni,et al.  Integrating genetics and epigenetics in breast cancer: biological insights, experimental, computational methods and therapeutic potential , 2015, BMC Systems Biology.

[47]  P. Laird,et al.  Discovery of multi-dimensional modules by integrative analysis of cancer genomic data , 2012, Nucleic acids research.

[48]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[49]  B. Christensen,et al.  Review of processing and analysis methods for DNA methylation array data , 2013, British Journal of Cancer.