Kernel Fusion Method for Detecting Cancer Subtypes via Selecting Relevant Expression Data

Recently, cancer has been characterized as a heterogeneous disease composed of many different subtypes. Early diagnosis of cancer subtypes is an important study of cancer research, which can be of tremendous help to patients after treatment. In this paper, we first extract a novel dataset, which contains gene expression, miRNA expression, and isoform expression of five cancers from The Cancer Genome Atlas (TCGA). Next, to avoid the effect of noise existing in 60, 483 genes, we select a small number of genes by using LASSO that employs gene expression and survival time of patients. Then, we construct one similarity kernel for each expression data by using Chebyshev distance. And also, We used SKF to fused the three similarity matrix composed of gene, Iso, and miRNA, and finally clustered the fused similarity matrix with spectral clustering. In the experimental results, our method has better P-value in the Cox model than other methods on 10 cancer data from Jiang Dataset and Novel Dataset. We have drawn different survival curves for different cancers and found that some genes play a key role in cancer. For breast cancer, we find out that HSPA2A, RNASE1, CLIC6, and IFITM1 are highly expressed in some specific groups. For lung cancer, we ensure that C4BPA, SESN3, and IRS1 are highly expressed in some specific groups. The code and all supporting data files are available from https://github.com/guofei-tju/Uncovering-Cancer-Subtypes-via-LASSO.

[1]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[2]  Yu-Bin Yang,et al.  Lung cancer cell identification based on artificial neural network ensembles , 2002, Artif. Intell. Medicine.

[3]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[4]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Ruan Xiaogang,et al.  Cancer Subtype Recognition and Feature Selection with Gene Expression Profiles , 2005 .

[6]  Cheng Li,et al.  Ovarian cancer is a heterogeneous disease. , 2005, Cancer genetics and cytogenetics.

[7]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[8]  M. Cevdet Ince,et al.  An expert system for detection of breast cancer based on association rules and neural network , 2009, Expert Syst. Appl..

[9]  Xiaohong Fang,et al.  Recognition of subtype non-small cell lung cancer by DNA aptamers selected from living cells. , 2009, The Analyst.

[10]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[11]  Shahram Jafari,et al.  An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network , 2011 .

[12]  N. Krivulin An algebraic approach to multidimensional minimax location problems with Chebyshev distance , 2011, 1211.2425.

[13]  Stephen T. C. Wong,et al.  A gene signature based method for identifying subtypes and subtype-specific drivers in cancer with an application to medulloblastoma , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[14]  Christopher Leckie,et al.  FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number , 2012, Bioinform..

[15]  Shibing Deng,et al.  Whole-genome sequencing and comprehensive molecular profiling identify new driver mutations in gastric cancer , 2014, Nature Genetics.

[16]  Richard W Tothill,et al.  Navigating the challenge of tumor heterogeneity in cancer therapy. , 2014, Cancer discovery.

[17]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[18]  Martin A. Nowak,et al.  Spatial Heterogeneity in Drug Concentrations Can Facilitate the Emergence of Resistance to Cancer Therapy , 2014, PLoS Comput. Biol..

[19]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[20]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[21]  D. Skinner,et al.  Flexible positions, managed hopes: the promissory bioeconomy of a whole genome sequencing cancer study. , 2015, Social science & medicine.

[22]  Dayong Wang,et al.  Deep Learning for Identifying Metastatic Breast Cancer , 2016, ArXiv.

[23]  Jijun Tang,et al.  Identification of drug-target interactions via multiple information integration , 2017, Inf. Sci..

[24]  J. Ajani,et al.  Clinical Significance of Four Molecular Subtypes of Gastric Cancer Identified by The Cancer Genome Atlas Project , 2017, Clinical Cancer Research.

[25]  Aidong Zhang,et al.  Integrate multi-omic data using affinity network fusion (ANF) for cancer patient clustering , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[26]  Hao Wu,et al.  Accounting for tumor purity improves cancer subtype classification from DNA methylation data , 2017, Bioinform..

[27]  R. Franco,et al.  CRISPR-barcoding in non small cell lung cancer: from intratumor genetic heterogeneity modeling to cancer therapy application. , 2017, Journal of thoracic disease.

[28]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[29]  Jijun Tang,et al.  FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association , 2018, BMC Genomics.

[30]  Xiaoyong Pan,et al.  Identification of the copy number variant biomarkers for breast cancer subtypes , 2018, Molecular Genetics and Genomics.

[31]  Xuequn Shang,et al.  Improvement of cancer subtype prediction by incorporating transcriptome expression data and heterogeneous biological networks , 2018, BMC Medical Genomics.

[32]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[33]  Lei Chen,et al.  Identification of Differentially Expressed Genes between Original Breast Cancer and Xenograft Using Machine Learning Algorithms , 2022 .

[34]  Nathalie Villa-Vialaneix,et al.  Unsupervised multiple kernel learning for heterogeneous data integration , 2017, bioRxiv.

[35]  Alioune Ngom,et al.  A review on machine learning principles for multi-view biological data integration , 2016, Briefings Bioinform..

[36]  Dan Wang,et al.  Progressive approach for SNP calling and haplotype assembly using single molecular sequencing data , 2018, Bioinform..

[37]  A. Nielsen,et al.  The influence of paternal diet on sncRNA-mediated epigenetic inheritance , 2018, Molecular Genetics and Genomics.

[38]  Fei Guo,et al.  Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data , 2019, Front. Genet..

[39]  Yu-Dong Cai,et al.  Classification of Widely and Rarely Expressed Genes with Recurrent Neural Network , 2018, Computational and structural biotechnology journal.

[40]  Jijun Tang,et al.  Identification of drug-side effect association via multiple information integration with centered kernel alignment , 2019, Neurocomputing.

[41]  Jijun Tang,et al.  Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. , 2019, Journal of theoretical biology.

[42]  Bilal Mirza,et al.  Machine Learning and Integrative Analysis of Biomedical Big Data , 2019, Genes.