Multi-omic and multi-view clustering algorithms: review and cancer benchmark

High throughput experimental methods developed in recent years have been used to collect large biomedical omics datasets. Clustering of such datasets has proven invaluable for biological and medical research, and helped reveal structure in data from several domains. Such analysis is often based on investigation of a single omic. The decreasing cost and development of additional high throughput methods now enable measurement of multi-omic data. Clustering multi-omic data has the potential to reveal further systems-level insights, but raises computational and biological challenges. Here we review algorithms for multi-omics clustering, and discuss key issues in applying these algorithms. Our review covers methods developed specifically for multi-omic data as well as generic multi-view methods developed in the machine learning community for joint clustering of multiple data types. In addition, using cancer data from TCGA, we perform an extensive benchmark spanning ten different cancer types, providing the first systematic benchmark comparison of leading multi-omics and multiview clustering algorithms. The results highlight several key questions regarding the use of single-vs. multi-omics, the choice of clustering strategy, the power of generic multi-view methods and the use of approximated p-values for gauging solution quality. Due to the rapidly increasing use of multi-omics data, these issues may be important for future progress in the field.

[1]  Corrado Priami,et al.  Multi-omics integration - a comparison of unsupervised clustering methodologies , 2019, Briefings Bioinform..

[2]  Roberto Tagliaferri,et al.  Robust clustering of noisy high-dimensional gene expression data for patients subtyping , 2018, Bioinform..

[3]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[4]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[5]  Marina Vannucci,et al.  A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. , 2018, Biostatistics.

[6]  Holger Fröhlich,et al.  Towards clinically more relevant dissection of patient heterogeneity via survival‐based Bayesian clustering , 2017, Bioinform..

[7]  Shiliang Sun,et al.  Multi-view learning overview: Recent progress and new challenges , 2017, Inf. Fusion.

[8]  S. Drăghici,et al.  A novel approach for data integration and disease subtyping , 2017, Genome research.

[9]  Kumardeep Chaudhary,et al.  Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer , 2017, Clinical Cancer Research.

[10]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[11]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[12]  Lorenz Wernisch,et al.  Clusternomics: Integrative context-dependent clustering for heterogeneous datasets , 2017, bioRxiv.

[13]  Kim-Anh Lê Cao,et al.  mixOmics: An R package for ‘omics feature selection and multiple data integration , 2017, bioRxiv.

[14]  Eric F. Lock,et al.  R.JIVE for exploration of multi-source molecular data , 2016, Bioinform..

[15]  Nacim Fateh Chikhi,et al.  Multi-view clustering via spectral partitioning and local refinement , 2016, Inf. Process. Manag..

[16]  Pao-Yang Chen,et al.  Profiling genome-wide DNA methylation , 2016, Epigenetics & Chromatin.

[17]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[18]  Aedín C. Culhane,et al.  Dimension reduction techniques for the integrative analysis of multi-omics data , 2016, Briefings Bioinform..

[19]  Jin Gu,et al.  Integrative clustering methods of multi-omics data for molecule-based cancer classifications , 2016, Quantitative Biology.

[20]  Francis R. Bach,et al.  Beyond CCA: Moment Matching for Multi-View Models , 2016, ICML.

[21]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[22]  Vinay Prasad,et al.  Precision oncology: origins, optimism, and potential. , 2016, The Lancet. Oncology.

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[24]  Jeanine J. Houwing-Duistermaat,et al.  Evaluation of O2PLS in Omics data integration , 2016, BMC Bioinformatics.

[25]  Luciano Milanesi,et al.  Methods for the integration of multi-omics data: mathematical aspects , 2016, BMC Bioinformatics.

[26]  Xiaochun Cao,et al.  Low-Rank Tensor Constrained Multiview Subspace Clustering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Michael Q. Zhang,et al.  Fast dimension reduction and integrative clustering of multi-omics data using low-rank approximation: application to cancer molecular classification , 2015, BMC Genomics.

[28]  Barbara Webb,et al.  A Model of Drosophila Larva Chemotaxis , 2015, PLoS Comput. Biol..

[29]  Ting Chen,et al.  Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal Deep Learning Approach , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Nico Pfeifer,et al.  Integrating different data types by regularized unsupervised multiple kernel learning with application to cancer subtype discovery , 2015, Bioinform..

[31]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[32]  Yong Luo,et al.  Tensor Canonical Correlation Analysis for Multi-View Dimension Reduction , 2015, IEEE Transactions on Knowledge and Data Engineering.

[33]  Hong Yu,et al.  Constrained NMF-Based Multi-View Clustering on Unmapped Data , 2015, AAAI.

[34]  Feiping Nie,et al.  Large-Scale Multi-View Spectral Clustering via Bipartite Graph , 2015, AAAI.

[35]  Aidan Budd,et al.  Ten Simple Rules for Organizing an Unconference , 2015, PLoS Comput. Biol..

[36]  Eli Upfal,et al.  Accurate Computation of Survival Statistics in Genome-Wide Studies , 2013, PLoS Comput. Biol..

[37]  Marinka Zitnik,et al.  Data Fusion by Matrix Factorization , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Anna Goldenberg,et al.  EquiNMF: Graph Regularized Multiview Nonnegative Matrix Factorization , 2014, ArXiv.

[39]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[40]  Benjamin J. Raphael,et al.  Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin , 2014, Cell.

[41]  Marinka Zitnik,et al.  Survival regression by data fusion , 2014 .

[42]  Haroon Idrees,et al.  NMF-KNN: Image Annotation Using Weighted Multi-view Non-negative Matrix Factorization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Shao-Yuan Li,et al.  Partial Multi-View Clustering , 2014, AAAI.

[44]  Lei Du,et al.  Robust Multi-View Spectral Clustering via Low-Rank and Sparse Decomposition , 2014, AAAI.

[45]  Aedín C. Culhane,et al.  A multivariate approach to the integration of multi-omics datasets , 2014, BMC Bioinformatics.

[46]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[47]  Feiping Nie,et al.  Robust Manifold Nonnegative Matrix Factorization , 2014, ACM Trans. Knowl. Discov. Data.

[48]  Vince D. Calhoun,et al.  Group sparse canonical correlation analysis for genomic data integration , 2013, BMC Bioinformatics.

[49]  Yuhong Guo,et al.  Convex Subspace Representation Learning from Multi-View Data , 2013, AAAI.

[50]  Naomi R. Wray,et al.  Assessment of Response to Lithium Maintenance Treatment in Bipolar Disorder: A Consortium on Lithium Genetics (ConLiGen) Report , 2013, PloS one.

[51]  Feiping Nie,et al.  Multi-View Clustering and Feature Learning via Structured Sparsity , 2013, ICML.

[52]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[53]  F. Bushman,et al.  Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. , 2013, Biostatistics.

[54]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[55]  C. Sander,et al.  Pattern discovery and cancer gene identification in integrated cancer genomic data , 2013, Proceedings of the National Academy of Sciences.

[56]  Z. Yakhini,et al.  Identifying In-Trans Process Associated Genes in Breast Cancer by Integrated Analysis of Copy Number and Expression Data , 2013, PloS one.

[57]  P. Suñé,et al.  Positive Outcomes Influence the Rate and Time to Publication, but Not the Impact Factor of Publications of Clinical Trial Results , 2013, PloS one.

[58]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[59]  Samuel Kaski,et al.  Bayesian Canonical correlation analysis , 2013, J. Mach. Learn. Res..

[60]  Jiawei Han,et al.  Multi-View Clustering via Joint Nonnegative Matrix Factorization , 2013, SDM.

[61]  Martha White,et al.  Convex Multi-view Subspace Learning , 2012, NIPS.

[62]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[63]  Zoubin Ghahramani,et al.  Bayesian correlated clustering to integrate multiple datasets , 2012, Bioinform..

[64]  David Haussler,et al.  PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis , 2012, Bioinform..

[65]  P. Laird,et al.  Discovery of multi-dimensional modules by integrative analysis of cancer genomic data , 2012, Nucleic acids research.

[66]  Shi-Hua Zhang,et al.  Identifying multi-layer gene regulatory modules from multi-dimensional genomic data , 2012, Bioinform..

[67]  Bo Wang,et al.  Unsupervised metric fusion by cross diffusion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Hal Daumé,et al.  Co-regularized Multi-view Spectral Clustering , 2011, NIPS.

[69]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[70]  Florian Markowetz,et al.  Patient-Specific Data Fusion Defines Prognostic Cancer Subtypes , 2011, PLoS Comput. Biol..

[71]  Tommy Löfstedt,et al.  OnPLS—a novel multiblock method for the modelling of predictive and orthogonal variation , 2011 .

[72]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[73]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[74]  David Haussler,et al.  Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM , 2010, Bioinform..

[75]  Zoubin Ghahramani,et al.  Discovering transcriptional modules by Bayesian data integration , 2010, Bioinform..

[76]  Stéphane Marchand-Maillet,et al.  Multiview clustering: a late fusion approach using latent models , 2009, SIGIR.

[77]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[78]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[79]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[80]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[81]  Philippe Besse,et al.  Sparse canonical methods for biological data integration: application to a cross-platform study , 2009, BMC Bioinformatics.

[82]  Philippe Besse,et al.  Statistical Applications in Genetics and Molecular Biology A Sparse PLS for Variable Selection when Integrating Omics Data , 2011 .

[83]  Philip S. Yu,et al.  A General Model for Multiple View Unsupervised Learning , 2008, SDM.

[84]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[85]  David W. Hosmer,et al.  Applied Survival Analysis: Regression Modeling of Time-to-Event Data , 2008 .

[86]  Daniel Eriksson,et al.  Data integration in plant biology: the O2PLS method for combined modeling of transcript and metabolite data. , 2007, The Plant journal : for cell and molecular biology.

[87]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[88]  M. Rantalainen,et al.  Kernel‐based orthogonal projections to latent structures (K‐OPLS) , 2007 .

[89]  Christopher J. C. Burges,et al.  Spectral clustering and transductive learning with multiple views , 2007, ICML '07.

[90]  V. D. Sa Spectral Clustering with Two Views , 2007 .

[91]  Ignacio Santamaría,et al.  A learning algorithm for adaptive canonical correlation analysis of several data sets , 2007, Neural Networks.

[92]  S. Geer,et al.  Regularization in statistics , 2006 .

[93]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[94]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[95]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[96]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[97]  George Stephanopoulos,et al.  Inverse modeling using multi-block PLS to determine the environmental conditions that provide optimal cellular function , 2004, Bioinform..

[98]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[99]  Jean Thioulouse,et al.  CO‐INERTIA ANALYSIS AND THE LINKING OF ECOLOGICAL DATA TABLES , 2003 .

[100]  J. Trygg O2‐PLS for qualitative and quantitative analysis in multivariate calibration , 2002 .

[101]  Roman Rosipal,et al.  Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space , 2002, J. Mach. Learn. Res..

[102]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[103]  Michael William Newman,et al.  The Laplacian spectrum of graphs , 2001 .

[104]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[105]  Colin Fyfe,et al.  Kernel and Nonlinear Canonical Correlation Analysis , 2000, IJCNN.

[106]  P. V. Rao,et al.  Applied Survival Analysis: Regression Modeling of Time to Event Data , 2000 .

[107]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[108]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[109]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[110]  Judith D. Goldberg,et al.  Applied Survival Analysis , 1999, Technometrics.

[111]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[112]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[113]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[114]  B. Mohar THE LAPLACIAN SPECTRUM OF GRAPHS y , 1991 .

[115]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[116]  Judea Pearl,et al.  Chapter 2 – BAYESIAN INFERENCE , 1988 .

[117]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[118]  D. Cox,et al.  Analysis of Survival Data. , 1985 .

[119]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[120]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[121]  L. Asz Random Walks on Graphs: a Survey , 2022 .