Accurate identification of pathways associated with cancer phenotypes (e.g., cancer sub-types and treatment outcome) could lead to discovering reliable prognostic and/or predictive biomarkers for better patients stratification and treatment guidance. In our previous work, we have shown that non-negative matrix tri-factorization (NMTF) can be successfully applied to identify pathways associated with specific cancer types or disease classes as a prognostic and predictive biomarker. However, one key limitation of non-negative factorization methods, including various non-negative bi-factorization methods, is their lack of ability to handle non-negative input data. For example, many molecular data that consist of real-values containing both positive and negative values (e.g., normalized/log transformed gene expression data where negative value represents down-regulated expression of genes) are not suitable input for these algorithms. In addition, most previous methods provide just a single point estimate and hence cannot deal with uncertainty effectively. To address these limitations, we propose a Bayesian semi-nonnegative matrix trifactorization method to identify pathways associated with cancer phenotypes from a realvalued input matrix, e.g., gene expression values. Motivated by semi-nonnegative factorization, we allow one of the factor matrices, the centroid matrix, to be real-valued so that each centroid can express either the up- or down-regulation of the member genes in a pathway. In addition, we place structured spike-and-slab priors (which are encoded with the pathways and a gene-gene interaction (GGI) network) on the centroid matrix so that even a set of genes that is not initially contained in the pathways (due to the incompleteness of the current pathway database) can be involved in the factorization in a stochastic way specifically, if those genes are connected to the member genes of the pathways on the GGI network. We also present update rules for the posterior distributions in the framework of variational inference. As a full Bayesian method, our proposed method has several advantages over the current NMTF methods which are demonstrated using synthetic datasets in experiments. Using the The Cancer Genome Atlas (TCGA) gastric cancer and metastatic gastric cancer immunotherapy clinical-trial datasets, we show that our method could identify biologically and clinically relevant pathways associated with the molecular sub-types and immunotherapy response, respectively. Finally, we show that those pathways identified by the proposed method could be used as prognostic biomarkers to stratify patients with distinct survival outcome in two independent validation datasets. Additional information and codes can be found at https://github.com/parks-cs-ccf/BayesianSNMTF.
[1]
Chris H. Q. Ding,et al.
Orthogonal nonnegative matrix t-factorizations for clustering
,
2006,
KDD '06.
[2]
Jason G. Jin,et al.
Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes
,
2015,
Nature Medicine.
[3]
Ole Winther,et al.
Bayesian Inference for Structured Spike and Slab Priors
,
2014,
NIPS.
[4]
J. Ajani,et al.
Clinical Significance of Four Molecular Subtypes of Gastric Cancer Identified by The Cancer Genome Atlas Project
,
2017,
Clinical Cancer Research.
[5]
Joon-Oh Park,et al.
Comprehensive molecular characterization of clinical responses to PD-1 inhibition in metastatic gastric cancer
,
2018,
Nature Medicine.
[6]
C. Figdor,et al.
Migrating into the Tumor: a Roadmap for T Cells.
,
2017,
Trends in cancer.
[7]
Karthik Devarajan,et al.
Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology
,
2008,
PLoS Comput. Biol..
[8]
Miguel Lázaro-Gredilla,et al.
Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning
,
2011,
NIPS.
[9]
Chris H. Q. Ding,et al.
Convex and Semi-Nonnegative Matrix Factorizations
,
2010,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
[10]
Christopher M. Bishop,et al.
Pattern Recognition and Machine Learning (Information Science and Statistics)
,
2006
.
[11]
Philip M. Kim,et al.
Subsystem identification through dimensionality reduction of large-scale gene expression data.
,
2003,
Genome research.
[12]
J. Massagué,et al.
TGF-β Inhibition and Immunotherapy: Checkmate.
,
2018,
Immunity.
[13]
Pietro Liò,et al.
Fast Bayesian Non-Negative Matrix Factorisation and Tri-Factorisation
,
2016,
ArXiv.
[14]
Joel Dudley,et al.
Network-Based Elucidation of Human Disease Similarities Reveals Common Functional Modules Enriched for Pluripotent Drug Targets
,
2010,
PLoS Comput. Biol..
[15]
Sunho Park,et al.
An integrative somatic mutation analysis to identify pathways linked with survival outcomes across 19 cancer types
,
2015,
bioRxiv.
[16]
Jill P. Mesirov,et al.
Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data
,
2003,
Machine Learning.