Bidimensional linked matrix factorization for pan-omics pan-cancer analysis

Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, pan-omics pan-cancer analysis, have extended our knowledge of molecular heterogenity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such bidimensionally linked matrices, BIDIFAC+. This decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices, which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives an identifiable decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across 4 different omics platforms and 29 different cancer types.

[1]  Prahlad T. Ram,et al.  A pan-cancer proteomic perspective on The Cancer Genome Atlas , 2014, Nature Communications.

[2]  J. Marioni,et al.  Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets , 2018, Molecular systems biology.

[3]  Sijian Wang,et al.  SPARSE INTEGRATIVE CLUSTERING OF MULTIPLE OMICS DATA SETS. , 2013, The annals of applied statistics.

[4]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[5]  Andrew B. Nobel,et al.  Reconstruction of a low-rank matrix in the presence of Gaussian noise , 2010, J. Multivar. Anal..

[6]  Zhiguang Huo,et al.  Integrative Sparse K-Means With Overlapping Group Lasso in Genomic Applications for Disease Subtype Discovery. , 2017, The annals of applied statistics.

[7]  Steven J. M. Jones,et al.  Comprehensive molecular profiling of lung adenocarcinoma , 2014, Nature.

[8]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[9]  Magne Thoresen,et al.  Integrative clustering of high-dimensional data with joint and individual clusters. , 2014, Biostatistics.

[10]  S. Gabriel,et al.  Pan-cancer patterns of somatic copy-number alteration , 2013, Nature Genetics.

[11]  David L. Donoho,et al.  Optimal Shrinkage of Singular Values , 2014, IEEE Transactions on Information Theory.

[12]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[13]  Lorenz Wernisch,et al.  Clusternomics: Integrative context-dependent clustering for heterogeneous datasets , 2017, bioRxiv.

[14]  Benjamin J. Raphael,et al.  Mutational landscape and significance across 12 major cancer types , 2013, Nature.

[15]  Stephen P. Boyd,et al.  A rank minimization heuristic with application to minimum order system approximation , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[16]  C. Hutter,et al.  The Cancer Genome Atlas: Creating Lasting Value beyond Its Data , 2018, Cell.

[17]  Steven J. M. Jones,et al.  Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. , 2015, The New England journal of medicine.

[18]  George Michailidis,et al.  A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data , 2015, Bioinform..

[19]  Eric F Lock,et al.  Generalized integrative principal component analysis for multi-type data with block-wise missing structure. , 2018, Biostatistics.

[20]  Eric F. Lock,et al.  R.JIVE for exploration of multi-source molecular data , 2016, Bioinform..

[21]  Peter W. Laird,et al.  Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer , 2018, Cell.

[22]  Marina Vannucci,et al.  A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. , 2018, Biostatistics.

[23]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[24]  Sungkyu Jung,et al.  Incorporating covariates into integrated factor analysis of multi‐view data , 2017, Biometrics.

[25]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[26]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[27]  Benjamin J. Raphael,et al.  Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin , 2014, Cell.

[28]  Gen Li,et al.  A general framework for association analysis of heterogeneous data , 2017, The Annals of Applied Statistics.

[29]  David B. Dunson,et al.  Bayesian consensus clustering , 2013, Bioinform..

[30]  András A. Benczúr,et al.  Methods for large scale SVD with missing values , 2007 .

[31]  E. Lock,et al.  Linked matrix factorization , 2017, Biometrics.

[32]  R. Wilson,et al.  Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. , 2010, Cancer cell.

[33]  Eric F. Lock,et al.  Integrative factorization of bidimensionally linked matrices , 2019, Biometrics.

[34]  Gen Li,et al.  Structural learning and integrative decomposition of multi‐view data , 2017, Biometrics.