Directionally dependent multi-view clustering using copula model

Recent developments in high-throughput methods have resulted in the collection of high-dimensional data types from multiple sources and technologies that measure distinct yet complementary information. Integrated clustering of such multiple data types or multi-view clustering is critical for revealing pathological insights. However, multi-view clustering is challenging due to the complex dependence structure between multiple data types, including directional dependency. Specifically, genomics data types have pre-specified directional dependencies known as the central dogma that describes the process of information flow from DNA to messenger RNA (mRNA) and then from mRNA to protein. Most of the existing multi-view clustering approaches assume an independent structure or pair-wise (non-directional) dependence between data types, thereby ignoring their directional relationship. Motivated by this, we propose a biology-inspired Bayesian integrated multi-view clustering model that uses an asymmetric copula to accommodate the directional dependencies between the data types. Via extensive simulation experiments, we demonstrate the negative impact of ignoring directional dependency on clustering performance. We also present an application of our model to a real-world dataset of breast cancer tumor samples collected from The Cancer Genome Altas program and provide comparative results.

[1]  Sambasivarao Damaraju,et al.  Germline copy number variations are associated with breast cancer risk and prognosis , 2017, Scientific Reports.

[2]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[3]  Jong-Min Kim,et al.  Analysis of directional dependence using asymmetric copula-based regression models , 2014 .

[4]  Michael T. Zimmermann,et al.  Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas , 2018, Cell reports.

[5]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[6]  Stephen G. Walker Sampling the Dirichlet Mixture Model with Slices , 2006 .

[7]  George C Tseng,et al.  Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization. , 2017, Biostatistics.

[8]  Shiliang Sun,et al.  Multi-view learning overview: Recent progress and new challenges , 2017, Inf. Fusion.

[9]  Carl E. Rasmussen,et al.  Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution , 2010, Journal of Computer Science and Technology.

[10]  G. Ast,et al.  DNA methylation directs microRNA biogenesis in mammalian cells , 2019, Nature Communications.

[11]  Marcel Weber The Central Dogma as a thesis of causal specificity. , 2006, History and philosophy of the life sciences.

[12]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[13]  Luay Nakhleh,et al.  Assessing the performance of methods for copy number aberration detection from single-cell DNA sequencing data , 2020, PLoS Comput. Biol..

[14]  WangHongjun,et al.  Bayesian cluster ensembles , 2011 .

[15]  Eckhard Liebscher,et al.  Construction of asymmetric multivariate copulas , 2008 .

[16]  Wenbin Xu,et al.  Dynamic viability of the 2016 Mw 7.8 Kaikōura earthquake cascade on weak crustal faults , 2018, Nature Communications.

[17]  K. Mengersen,et al.  Asymptotic behaviour of the posterior distribution in overfitted mixture models , 2011 .

[18]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[19]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[20]  Fernando A. Quintana,et al.  Bayesian Nonparametric Data Analysis , 2015 .

[21]  Volker Roth,et al.  Copula Mixture Model for Dependency-seeking Clustering , 2012, ICML.

[22]  Katja Ickstadt,et al.  Toward Integrative Bayesian Analysis in Molecular Biology , 2018 .

[23]  M. Cugmas,et al.  On comparing partitions , 2015 .

[24]  Luay Nakhleh,et al.  A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference , 2019, bioRxiv.

[25]  Rich Caruana,et al.  Consensus Clusterings , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[26]  Arindam Banerjee,et al.  Bayesian cluster ensembles , 2009, Stat. Anal. Data Min..

[27]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .

[28]  Aria,et al.  INTEGRATIVE MODEL-BASED CLUSTERING OF MICROARRAY METHYLATION AND EXPRESSION DATA , 2011 .

[29]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[30]  K. Cui,et al.  Spike-and-Slab Dirichlet Process Mixture Models , 2012 .

[31]  XuXin,et al.  Multi-view learning overview , 2017 .

[32]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[33]  Engin A. Sungur A Note on Directional Dependence in Regression Setting , 2005 .

[34]  Engin A. Sungur Some Observations on Copula Regression Functions , 2005 .

[35]  Kathryn B. Laskey,et al.  Nonparametric Bayesian Co-clustering Ensembles , 2011, SDM.

[36]  Y. Dodge,et al.  Direction dependence in a regression line , 2000 .

[37]  H. Kröger,et al.  [Protein synthesis]. , 1974, Fortschritte der Medizin.

[38]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[39]  F. Crick Central Dogma of Molecular Biology , 1970, Nature.

[40]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[41]  Samuel Kaski,et al.  Local dependent components , 2007, ICML '07.

[42]  S. Walker,et al.  Normalized random measures driven by increasing additive processes , 2004, math/0508592.

[43]  Stephen G. Walker,et al.  Sampling the Dirichlet Mixture Model with Slices , 2006, Commun. Stat. Simul. Comput..

[44]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[45]  Pravin K. Trivedi,et al.  Copula Modeling: An Introduction for Practitioners , 2007 .

[46]  Manuel Úbeda-Flores,et al.  A new class of bivariate copulas , 2004 .

[47]  Jouni Helske,et al.  Mixture Hidden Markov Models for Sequence Data: The seqHMM Package in R , 2017, Journal of Statistical Software.

[48]  T. Niu,et al.  Identifying Multi-Omics Causers and Causal Pathways for Complex Traits , 2019, Front. Genet..

[49]  David M. Reif,et al.  Integrated analysis of genetic, genomic and proteomic data , 2004, Expert review of proteomics.

[50]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[51]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[52]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[53]  Raghu Machiraju,et al.  Breast cancer patient stratification using a molecular regularized consensus clustering method. , 2014, Methods.

[54]  Daniel Kraus D-vine copula based quantile regression and the simplifying assumption for vine copulas , 2017 .

[55]  V. Rocková,et al.  Dynamic Variable Selection with Spike-and-Slab Process Priors , 2017, Bayesian Analysis.

[56]  Samuel Kotz,et al.  New generalized Farlie-Gumbel-Morgenstern distributions and concomitants of order statistics , 2001 .

[57]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[58]  J. Booth,et al.  Integrative Model-based clustering of microarray methylation and expression data , 2012, 1210.0702.

[59]  A. Emons,et al.  Boekbespreking: Molecular biology of the cell, B. Alberts, D. Bray, J. Lewis, M. Raff, K. Robers, D.J. Watson. Garland Publ., New York. 1989. , 1990 .

[60]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[61]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[62]  A. Maruotti Mixed Hidden Markov Models for Longitudinal Data: An Overview , 2011 .

[63]  Fabrizio Durante,et al.  Copula Theory and Its Applications , 2010 .

[64]  Konrad J. Karczewski,et al.  Integrative omics for health and disease , 2018, Nature Reviews Genetics.

[65]  Jinhwa Kim,et al.  New Approach of Directional Dependence in Exchange Markets Using Generalized FGM Copula Function , 2008, Commun. Stat. Simul. Comput..

[66]  Jong-Min Kim,et al.  Directional Dependence of Genes Using Survival Truncated FGM Type Modification Copulas , 2009, Commun. Stat. Simul. Comput..

[67]  R. Shamir,et al.  Multi-omic and multi-view clustering algorithms: review and cancer benchmark , 2018, bioRxiv.

[68]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumours , 2013 .

[69]  A. McNeil,et al.  The t Copula and Related Copulas , 2005 .

[70]  Avi Ma’ayan,et al.  Metasignatures Identify Two Major Subtypes of Breast Cancer , 2013, CPT: pharmacometrics & systems pharmacology.

[71]  Stéphane Marchand-Maillet,et al.  Multiview clustering: a late fusion approach using latent models , 2009, SIGIR.