Min-Redundancy and Max-Relevance Multi-view Feature Selection for Predicting Ovarian Cancer Survival using Multi-omics Data

Background Large-scale collaborative precision medicine initiatives (e.g., The Cancer Genome Atlas (TCGA)) are yielding rich multi-omics data. Integrative analyses of the resulting multi-omics data, such as somatic mutation, copy number alteration (CNA), DNA methylation, miRNA, gene expression, and protein expression, offer the tantalizing possibilities of realizing the potential of precision medicine in cancer prevention, diagnosis, and treatment by substantially improving our understanding of underlying mechanisms as well as the discovery of novel biomarkers for different types of cancers. However, such analyses present a number of challenges, including the heterogeneity of data types, and the extreme high-dimensionality of omics data. Methods In this study, we propose a novel framework for integrating multi-omics data based on multi-view feature selection, an emerging research problem in machine learning research. We also present a novel multi-view feature selection algorithm, MRMR-mv, which adapts the well-known Min-Redundancy and Maximum-Relevance (MRMR) single-view feature selection algorithm for the multi-view settings. Results We report results of experiments on the task of building a predictive model of cancer survival from an ovarian cancer multi-omics dataset derived from the TCGA database. Our results suggest that multi-view models for predicting ovarian cancer survival outperform both view-specific models (i.e., models trained and tested using one multi-omics data source) and models based on two baseline data fusion methods. Conclusions Our results demonstrate the potential of multi-view feature selection in integrative analyses and predictive modeling from multi-omics data.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  A. Frigessi,et al.  Principles and methods of integrative genomic analyses in cancer , 2014, Nature Reviews Cancer.

[3]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[4]  A. Services,et al.  Integrated genomic and molecular characterization of cervical cancer. , 2017 .

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Bing Niu,et al.  Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection. , 2009, Biochemical and biophysical research communications.

[7]  Shiliang Sun,et al.  Multi-view learning overview: Recent progress and new challenges , 2017, Inf. Fusion.

[8]  Marylyn D. Ritchie,et al.  Using knowledge-driven genomic interactions for multi-omics data analysis: metadimensional models for predicting clinical outcomes in ovarian carcinoma , 2017, J. Am. Medical Informatics Assoc..

[9]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[10]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[11]  Guna Seetharaman,et al.  Multiview Boosting With Information Propagation for Classification , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Giancarlo Raiconi,et al.  MVDA: a multi-view genomic data integration methodology , 2015, BMC Bioinformatics.

[13]  C. Sander,et al.  Pattern discovery and cancer gene identification in integrated cancer genomic data , 2013, Proceedings of the National Academy of Sciences.

[14]  J. Uhm Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2009 .

[15]  S. Hanash,et al.  Integrated global profiling of cancer , 2004, Nature Reviews Cancer.

[16]  Debashis Ghosh,et al.  Integrating Omics Data , 2016 .

[17]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[18]  Driss Aboutajdine,et al.  A two-stage gene selection scheme utilizing MRMR filter and GA wrapper , 2011, Knowledge and Information Systems.

[19]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[20]  Andreas Schulze-Bonhage,et al.  Feature selection in high dimensional EEG features spaces for epileptic seizure prediction , 2011 .

[21]  Ron Kohavi,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998 .

[22]  Gary D Bader,et al.  International network of cancer genome projects , 2010, Nature.

[23]  Yinghuan Shi,et al.  MRM-Lasso: A Sparse Multiview Feature Selection Method via Low-Rank Analysis , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[25]  Michael J. Watts,et al.  IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS Publication Information , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[26]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[27]  E. V. Van Allen,et al.  Next-generation sequencing to guide cancer therapy , 2015, Genome Medicine.

[28]  Qi Zheng,et al.  GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis , 2008, Nucleic Acids Res..

[29]  Bo Du,et al.  Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding , 2015, Pattern Recognit..

[30]  Weiqing Wang,et al.  PiHelper: an open source framework for drug-target and antibody-target data , 2013, Bioinform..

[31]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[32]  Korris Fu-Lai Chung,et al.  Multi-view L2-SVM and its multi-view core vector machine , 2016, Neural Networks.

[33]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[34]  Benjamin E. Gross,et al.  The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. , 2012, Cancer discovery.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Ju Han Kim,et al.  Incorporating inter-relationships between different levels of genomic data into cancer clinical outcome prediction. , 2014, Methods.

[37]  Yves A. Lussier,et al.  Breakthroughs in genomics data integration for predicting clinical outcome , 2012, J. Biomed. Informatics.

[38]  Natasa Przulj,et al.  Integrative methods for analyzing big data in precision medicine , 2016, Proteomics.

[39]  D. Fairlie,et al.  The Ribosomal Protein S19 Suppresses Antitumor Immune Responses via the Complement C5a Receptor 1 , 2017, The Journal of Immunology.

[40]  Huseyin Seker,et al.  Prediction of Protein Sub-nuclear Location by Clustering mRMR Ensemble Feature Selection , 2010, 2010 20th International Conference on Pattern Recognition.

[41]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[42]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[43]  Feiping Nie,et al.  Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Multi-View K-Means Clustering on Big Data , 2022 .

[44]  Kyung-Ah Sohn,et al.  Knowledge boosting: a graph-based integration approach with multi-omics data and genomic knowledge for cancer clinical outcome prediction , 2014, J. Am. Medical Informatics Assoc..

[45]  Steven J. M. Jones,et al.  Integrated genomic and molecular characterization of cervical cancer , 2017, Nature.

[46]  Mary Goldman,et al.  The UCSC Cancer Genomics Browser: update 2015 , 2014, Nucleic Acids Res..