Integrative Analysis of Multi-Omics Data Based on Blockwise Sparse Principal Components

The recent development of high-throughput technology has allowed us to accumulate vast amounts of multi-omics data. Because even single omics data have a large number of variables, integrated analysis of multi-omics data suffers from problems such as computational instability and variable redundancy. Most multi-omics data analyses apply single supervised analysis, repeatedly, for dimensional reduction and variable selection. However, these approaches cannot avoid the problems of redundancy and collinearity of variables. In this study, we propose a novel approach using blockwise component analysis. This would solve the limitations of current methods by applying variable clustering and sparse principal component (sPC) analysis. Our approach consists of two stages. The first stage identifies homogeneous variable blocks, and then extracts sPCs, for each omics dataset. The second stage merges sPCs from each omics dataset, and then constructs a prediction model. We also propose a graphical method showing the results of sparse PCA and model fitting, simultaneously. We applied the proposed methodology to glioblastoma multiforme data from The Cancer Genome Atlas. The comparison with other existing approaches showed that our proposed methodology is more easily interpretable than other approaches, and has comparable predictive power, with a much smaller number of variables.

[1]  Jorge Cadima Departamento de Matematica Loading and correlations in the interpretation of principle compenents , 1995 .

[2]  Xiaoyan Xu,et al.  Overexpression of oncostatin M receptor regulates local immune response in glioblastoma , 2019, Journal of cellular physiology.

[3]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[4]  Jufeng Li,et al.  Cancer immunotherapy based on blocking immune suppression mediated by an immune modulator LAIR-1 , 2020, Oncoimmunology.

[5]  Michael Krauthammer,et al.  Integrated analysis of multidimensional omics data on cutaneous melanoma prognosis. , 2016, Genomics.

[6]  Inderjit S. Dhillon,et al.  Diametrical clustering for identifying anti-correlated gene clusters , 2003, Bioinform..

[7]  P. Heagerty,et al.  Survival Model Predictive Accuracy and ROC Curves , 2005, Biometrics.

[8]  Qi Long,et al.  Incorporating biological information in sparse principal component analysis with application to genomic data , 2017, BMC Bioinformatics.

[9]  El Mostafa Qannari,et al.  Analysis of -omics data: Graphical interpretation- and validation tools in multi-block methods , 2010 .

[10]  Hongyu Zhao,et al.  Sparse principal component analysis by choice of norm , 2013, J. Multivar. Anal..

[11]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[12]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[13]  Aeilko H. Zwinderman,et al.  Sparse canonical correlation analysis for identifying, connecting and completing gene-expression networks , 2009, BMC Bioinformatics.

[14]  Chao Yang,et al.  RUNX1 contributes to the mesenchymal subtype of glioblastoma in a TGFβ pathway-dependent manner , 2019, Cell Death & Disease.

[15]  Katja Ickstadt,et al.  Toward Integrative Bayesian Analysis in Molecular Biology , 2018 .

[16]  Q. Fu,et al.  Clinicopathologic significance of LAIR-1 expression in hepatocellular carcinoma. , 2019, Current problems in cancer.

[17]  Sorin Draghici,et al.  A Multi-Cohort and Multi-Omics Meta-Analysis Framework to Identify Network-Based Gene Signatures , 2019, Front. Genet..

[18]  A. Frigessi,et al.  Principles and methods of integrative genomic analyses in cancer , 2014, Nature Reviews Cancer.

[19]  G. Tseng,et al.  Comprehensive literature review and statistical considerations for GWAS meta-analysis , 2012, Nucleic acids research.

[20]  Markus Ringnér,et al.  What is principal component analysis? , 2008, Nature Biotechnology.

[21]  Gelareh Zadeh,et al.  Glioblastoma, a Brief Review of History, Molecular Genetics, Animal Models and Novel Therapeutic Strategies , 2012, Archivum Immunologiae et Therapiae Experimentalis.

[22]  Changjun Wang,et al.  miR-602 Mediates the RASSF1A/JNK Pathway, Thereby Promoting Postoperative Recurrence in Nude Mice with Liver Cancer , 2020, OncoTargets and therapy.

[23]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[24]  F. Hu,et al.  Integrating genetic association, genetics of gene expression, and single nucleotide polymorphism set analysis to identify susceptibility Loci for type 2 diabetes mellitus. , 2012, American journal of epidemiology.

[25]  Konrad J. Karczewski,et al.  Integrative omics for health and disease , 2018, Nature Reviews Genetics.

[26]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[27]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[28]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Jun Dong,et al.  Effects of the myeloid cell nuclear differentiation antigen on the proliferation, apoptosis and migration of osteosarcoma cells , 2014, Oncology letters.

[31]  Jeffrey S. Morris,et al.  iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data , 2012, Bioinform..

[32]  Fugen Shangguan,et al.  LAIR-1 suppresses cell growth of ovarian cancer cell via the PI3K-AKT-mTOR pathway , 2020, Aging.

[33]  Fuli Liu,et al.  NMDA receptors are important regulators of pancreatic cancer and are potential targets for treatment , 2017, Clinical pharmacology : advances and applications.

[34]  J. Atkinson,et al.  Variable expression of human myeloid specific nuclear antigen MNDA in monocyte lineage cells in atherosclerosis , 2005, Journal of cellular biochemistry.

[35]  Cosetta Minelli,et al.  The meta-analysis of genome-wide association studies , 2011, Briefings Bioinform..

[36]  H. Akaike A new look at the statistical model identification , 1974 .

[37]  H. Kiers Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables , 1991 .

[38]  Shuangge Ma,et al.  A selective review of robust variable selection with applications in bioinformatics , 2015, Briefings Bioinform..

[39]  George Michailidis,et al.  A non-negative matrix factorization method for detecting modules in heterogeneous omics multi-modal data , 2015, Bioinform..

[40]  George C Tseng,et al.  Statistical Methods in Integrative Genomics. , 2016, Annual review of statistics and its application.

[41]  S. Pineda,et al.  Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer , 2015, PLoS genetics.

[42]  F. Harrell,et al.  Evaluating the yield of medical tests. , 1982, JAMA.

[43]  Kaiming Gao,et al.  Identification of intrinsic subtype-specific prognostic microRNAs in primary glioblastoma , 2014, Journal of experimental & clinical cancer research : CR.

[44]  Eric F Lock,et al.  JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES. , 2011, The annals of applied statistics.

[45]  Qing Zhao,et al.  Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA , 2015, Briefings Bioinform..

[46]  Zhi Chen,et al.  Circulating Exosomal miR-17-5p and miR-92a-3p Predict Pathologic Stage and Grade of Colorectal Cancer , 2018, Translational oncology.

[47]  E. Qannari,et al.  Deflation strategies for multi-block principal component analysis revisited , 2013 .

[48]  Huiran Lin,et al.  Prediction of a competing endogenous RNA co‐expression network as a prognostic marker in glioblastoma , 2020, Journal of cellular and molecular medicine.

[49]  R. Weichselbaum,et al.  BCL3 expression promotes resistance to alkylating chemotherapy in gliomas , 2018, Science Translational Medicine.

[50]  E. Vigneau,et al.  Clustering of Variables Around Latent Components , 2003 .

[51]  Aedín C. Culhane,et al.  Dimension reduction techniques for the integrative analysis of multi-omics data , 2016, Briefings Bioinform..

[52]  Stéphanie Bougeard,et al.  Clusterwise analysis for multiblock component methods , 2017, Advances in Data Analysis and Classification.

[53]  G. Kaur,et al.  Systematic Review of Protein Biomarkers of Invasive Behavior in Glioblastoma , 2013, Molecular Neurobiology.

[54]  Yue Wang,et al.  Clinical significance of leukocyte-associated immunoglobulin-like receptor-1 expression in human cervical cancer , 2016, Experimental and therapeutic medicine.

[55]  Shiva Kumar,et al.  Multi-omics Data Integration, Interpretation, and Its Application , 2020, Bioinformatics and biology insights.

[56]  George M Yousef,et al.  The miR-17-92 cluster is over expressed in and has an oncogenic effect on renal cell carcinoma. , 2010, The Journal of urology.

[57]  J. Keasling,et al.  Principal component analysis of proteomics (PCAP) as a tool to direct metabolic engineering. , 2015, Metabolic engineering.

[58]  Yu Jiang,et al.  A Selective Review of Multi-Level Omics Data Integration Using Variable Selection , 2019, High-throughput.

[59]  G. Schmidt,et al.  The use of ROC for defining the validity of the prognostic index in censored data , 2011 .

[60]  Gad Abraham,et al.  Fast Principal Component Analysis of Large-Scale Genome-Wide Data , 2014, bioRxiv.

[61]  Guobin Wang,et al.  MicroRNA-602 regulating tumor suppressive gene RASSF1A is over-expressed in hepatitis B virus-infected liver and hepatocellular carcinoma , 2010, Cancer biology & therapy.

[62]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[63]  Matthias Schmid,et al.  On the use of Harrell's C for clinical risk prediction via random survival forests , 2015, Expert Syst. Appl..

[64]  J. Anuradha,et al.  A Review of Feature Selection and Its Methods , 2019, Cybernetics and Information Technologies.