netMUG: a novel network-guided multi-view clustering workflow for dissecting genetic and facial heterogeneity

Multi-view data offer advantages over single-view data for characterizing individuals, which is crucial in precision medicine toward personalized prevention, diagnosis, or treatment follow-up. Here, we develop a network-guided multi-view clustering framework named netMUG to identify actionable subgroups of individuals. This pipeline first adopts sparse multiple canonical correlation analysis to select multi-view features possibly informed by extraneous data, which are then used to construct individual-specific networks (ISNs). Finally, the individual subtypes are automatically derived by hierarchical clustering on these network representations. We applied netMUG to a dataset containing genomic data and facial images to obtain BMI-informed multi-view strata and showed how it could be used for a refined obesity characterization. Benchmark analysis of netMUG on synthetic data with known strata of individuals indicated its superior performance compared with both baseline and benchmark methods for multi-view clustering. In addition, the real-data analysis revealed subgroups strongly linked to BMI and genetic and facial determinants of these classes. NetMUG provides a powerful strategy, exploiting individual-specific networks to identify meaningful and actionable strata. Moreover, the implementation is easy to generalize to accommodate heterogeneous data sources or highlight data structures. Author summary In recent years, we see the increasing possibility of collecting data from multiple modalities in various fields, requesting novel methods to exploit the consensus among different data types. As exemplified in systems biology or epistasis analyses, the interactions between features may contain more information than the features themselves, thereby necessitating the use of feature networks. Furthermore, in real-life scenarios, subjects, such as patients or individuals, may originate from diverse populations, which underscores the importance of subtyping or clustering these subjects to account for their heterogeneity. In this study, we present a novel pipeline for selecting the most relevant features from multiple data types, constructing a feature network for each subject, and obtaining a subgrouping of samples informed by a phenotype of interest. We validated our method on synthetic data and demonstrated its superiority over several state-of-the-art multi-view clustering approaches. Additionally, we applied our method to a real-life, large-scale dataset of genomic data and facial images, where it effectively identified a meaningful BMI subtyping that complemented existing BMI categories and offered new biological insights. Our proposed method has wide applicability to complex multi-view or multi-omics datasets for tasks such as disease subtyping or personalized medicine.

[1]  K. Van Steen,et al.  netANOVA: novel graph clustering technique with significance assessment via hierarchical ANOVA , 2022, bioRxiv.

[2]  Bratati Kahali,et al.  Concurrent outcomes from multiple approaches of epistasis analysis for human body mass index associated loci provide insights into obesity biology , 2022, Scientific Reports.

[3]  D. Rozman,et al.  Detecting gene–gene interactions from GWAS using diffusion kernel principal components , 2022, BMC Bioinformatics.

[4]  Gary D Bader,et al.  The reactome pathway knowledgebase 2022 , 2021, Nucleic Acids Res..

[5]  K. Borgwardt,et al.  Filtration Curves for Graph Representation , 2021, KDD.

[6]  Song He,et al.  Multi-dimensional data integration algorithm based on random walk with restart , 2021, BMC Bioinform..

[7]  Chaoyang Zhang,et al.  A Review of Integrative Imputation for Multi-Omics Datasets , 2020, Frontiers in Genetics.

[8]  Julie D. White,et al.  Insights into the genetic architecture of the human face , 2020, Nature Genetics.

[9]  Pierre Veyre,et al.  Evaluation of integrative clustering methods for the analysis of multi-omics data , 2019, Briefings Bioinform..

[10]  L. Liang,et al.  Shared Genetic and Experimental Links between Obesity-Related Traits and Asthma Subtypes in UK Biobank. , 2020, The Journal of allergy and clinical immunology.

[11]  F. Sanz,et al.  The DisGeNET knowledge platform for disease genomics: 2019 update , 2019, Nucleic Acids Res..

[12]  Sara M. Willems,et al.  Exome-Derived Adiponectin-Associated Variants Implicate Obesity and Lipid Biology. , 2019, American journal of human genetics.

[13]  David Watson,et al.  Spectrum: fast density-aware spectral clustering for single and multi-omic data , 2019, bioRxiv.

[14]  Katerina Kechris,et al.  Unsupervised discovery of phenotype-specific multi-omics networks , 2019, Bioinform..

[15]  Nils M. Kriege,et al.  A survey on graph kernels , 2019, Applied Network Science.

[16]  Vishal M. Patel,et al.  Deep Multimodal Subspace Clustering Networks , 2018, IEEE Journal of Selected Topics in Signal Processing.

[17]  Kathryn S. Burch,et al.  Leveraging polygenic functional enrichment to improve GWAS power , 2017, bioRxiv.

[18]  Zhenqiu Lu Canonical Correlation Analysis with Missing Values: A Structural Equation Modeling Approach , 2017, Springer Proceedings in Mathematics & Statistics.

[19]  Hongchao Lv,et al.  Genome-wide haplotype association study identify the FGFR2 gene as a risk gene for Acute Myeloid Leukemia , 2016, Oncotarget.

[20]  Natasa Przulj,et al.  Integrative methods for analyzing big data in precision medicine , 2016, Proteomics.

[21]  Suchi Saria,et al.  Subtyping: What It is and Its Role in Precision Medicine , 2015, IEEE Intelligent Systems.

[22]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[23]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[24]  C. Sander,et al.  Pattern discovery and cancer gene identification in integrated cancer genomic data , 2013, Proceedings of the National Academy of Sciences.

[25]  Tao Li,et al.  Consensus Clustering + Meta Clustering = Multiple Consensus Clustering , 2011, FLAIRS.

[26]  Inês Barroso,et al.  The genetics of obesity: FTO leads the way , 2010, Trends in genetics : TIG.

[27]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[28]  S. Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[29]  Subhajyoti De,et al.  Common variants near MC4R are associated with fat mass, weight and risk of obesity , 2008, Nature Genetics.

[30]  C. Minder,et al.  Distinguishing phenotypes of childhood wheeze and cough using latent class analysis , 2008, European Respiratory Journal.

[31]  Bin Zhang,et al.  Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R , 2008, Bioinform..

[32]  R. Herrmann,et al.  Prognostic and Predictive Relevance of DNAM-1, SOCS6 and CADH-7 Genes on Chromosome 18q in Colorectal Cancer , 2005, Oncology.

[33]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[34]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[35]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[36]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[37]  Marina Vannucci,et al.  A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. , 2018, Biostatistics.

[38]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[39]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.