Multi-omics facilitated variable selection in Cox-regression model for cancer prognosis prediction.

MOTIVATION New developments in high-throughput genomic technologies have enabled the measurement of diverse types of omics biomarkers in a cost-efficient and clinically-feasible manner. Developing computational methods and tools for analysis and translation of such genomic data into clinically-relevant information is an ongoing and active area of investigation. For example, several studies have utilized an unsupervised learning framework to cluster patients by integrating omics data. Despite such recent advances, predicting cancer prognosis using integrated omics biomarkers remains a challenge. There is also a shortage of computational tools for predicting cancer prognosis by using supervised learning methods. The current standard approach is to fit a Cox regression model by concatenating the different types of omics data in a linear manner, while penalty could be added for feature selection. A more powerful approach, however, would be to incorporate data by considering relationships among omics datatypes. METHODS Here we developed two methods: a SKI-Cox method and a wLASSO-Cox method to incorporate the association among different types of omics data. Both methods fit the Cox proportional hazards model and predict a risk score based on mRNA expression profiles. SKI-Cox borrows the information generated by these additional types of omics data to guide variable selection, while wLASSO-Cox incorporates this information as a penalty factor during model fitting. RESULTS We show that SKI-Cox and wLASSO-Cox models select more true variables than a LASSO-Cox model in simulation studies. We assess the performance of SKI-Cox and wLASSO-Cox using TCGA glioblastoma multiforme and lung adenocarcinoma data. In each case, mRNA expression, methylation, and copy number variation data are integrated to predict the overall survival time of cancer patients. Our methods achieve better performance in predicting patients' survival in glioblastoma and lung adenocarcinoma.

[1]  Mario Deng,et al.  FirebrowseR: an R client to the Broad Institute’s Firehose Pipeline , 2017, Database J. Biol. Databases Curation.

[2]  L. Mariani,et al.  Prognostic factors for metachronous contralateral breast cancer: A comparison of the linear Cox regression model and its artificial neural network extension , 1997, Breast Cancer Research and Treatment.

[3]  Jin‐Young Jang,et al.  22q11-q13 as a hot spot for prediction of disease-free survival in bile duct cancer: integrative analysis of copy number variations. , 2014, Cancer genetics.

[4]  David Haussler,et al.  Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM , 2010, Bioinform..

[5]  Chunling Zhang,et al.  Correlation between DNA methylation and gene expression in the brains of patients with bipolar disorder and schizophrenia , 2014, Bipolar disorders.

[6]  Aedín C. Culhane,et al.  A multivariate approach to the integration of multi-omics datasets , 2014, BMC Bioinformatics.

[7]  David R. Kincaid,et al.  Linear Algebra: Theory and Applications , 2010 .

[8]  Yi Li,et al.  PGS: a tool for association study of high-dimensional microRNA expression data with repeated measures , 2014, Bioinform..

[9]  A. Kallioniemi,et al.  Characterization of the 7q21‐q22 amplicon identifies ARPC1A, a subunit of the Arp2/3 complex, as a regulator of cell migration and invasion in pancreatic cancer , 2009, Genes, chromosomes & cancer.

[10]  M. Pencina,et al.  Overall C as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation , 2004, Statistics in medicine.

[11]  Thomas Lengauer,et al.  Classification with correlated features: unreliability of feature ranking and solutions , 2011, Bioinform..

[12]  Thomas M. Keane,et al.  The mutational landscapes of genetic and chemical models of Kras-driven lung cancer , 2014, Nature.

[13]  Philippe Bastien,et al.  PLS-Cox model: Application to gene expression data , 2004 .

[14]  Tao Wang,et al.  High-dimensional omics data analysis using a variable screening protocol with prior knowledge integration (SKI) , 2016, BMC Systems Biology.

[15]  Ralf Bender,et al.  Generating survival times to simulate Cox proportional hazards models , 2005, Statistics in medicine.

[16]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[17]  Dan Wang,et al.  IMA: an R package for high-throughput analysis of Illumina's 450K Infinium methylation data , 2012, Bioinform..

[18]  Stefan Michiels,et al.  Prediction of cancer outcome with microarrays: a multiple random validation strategy , 2005, The Lancet.

[19]  Federico Rotolo,et al.  Empirical extensions of the lasso penalty to reduce the false discovery rate in high‐dimensional Cox regression models , 2016, Statistics in medicine.

[20]  B. Liu,et al.  Expression of the Arp2/3 complex in human gliomas and its role in the migration and invasion of glioma cells. , 2013, Oncology reports.

[21]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[22]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[23]  Sampsa Hautaniemi,et al.  CNAmet: an R package for integrating copy number, methylation and expression data , 2011, Bioinform..

[24]  Wei Zhang,et al.  Biomarker discovery to improve prediction of breast cancer survival: using gene expression profiling, meta-analysis, and tissue validation , 2016, OncoTargets and therapy.

[25]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[26]  A. Uitterlinden,et al.  Inhibin Alpha-Subunit (INHA) Expression in Adrenocortical Cancer Is Linked to Genetic and Epigenetic INHA Promoter Variation , 2014, PloS one.

[27]  H. Zhang,et al.  Prediction efficiency of PITX2 DNA methylation for prostate cancer survival. , 2016, Genetics and molecular research : GMR.

[28]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[29]  I. Glad,et al.  Weighted Lasso with Data Integration , 2011, Statistical applications in genetics and molecular biology.

[30]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[31]  Haijun Gong,et al.  A transcriptome analysis by lasso penalized Cox regression for pancreatic cancer survival. , 2011, Journal of bioinformatics and computational biology.

[32]  D. Haussler,et al.  The Somatic Genomic Landscape of Glioblastoma , 2013, Cell.

[33]  G. Getz,et al.  GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers , 2011, Genome Biology.

[34]  Yang Feng,et al.  High-dimensional variable selection for Cox's proportional hazards model , 2010, 1002.3315.

[35]  G. Tseng,et al.  Genome abnormalities precede prostate cancer and predict clinical relapse. , 2012, The American journal of pathology.

[36]  Ping Wang,et al.  Prognostic analysis of ovarian cancer patients using the Cox regression model. , 2009, Ai zheng = Aizheng = Chinese journal of cancer.

[37]  R. Xiang,et al.  Prediction of survival of diffuse large B‐cell lymphoma patients via the expression of three inflammatory genes , 2016, Cancer medicine.

[38]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[39]  Qing Zhao,et al.  Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA , 2015, Briefings Bioinform..

[40]  S. Kong,et al.  Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso. , 2012, Statistica Sinica.

[41]  L. Freedman,et al.  The future of prognostic factors in outcome prediction for patients with cancer , 1992, Cancer.