Classification of Kidney Cancer Data Using Cost-Sensitive Hybrid Deep Learning Approach

Recently, large-scale bioinformatics and genomic data have been generated using advanced biotechnology methods, thus increasing the importance of analyzing such data. Numerous data mining methods have been developed to process genomic data in the field of bioinformatics. We extracted significant genes for the prognosis prediction of 1157 patients using gene expression data from patients with kidney cancer. We then proposed an end-to-end, cost-sensitive hybrid deep learning (COST-HDL) approach with a cost-sensitive loss function for classification tasks on imbalanced kidney cancer data. Here, we combined the deep symmetric auto encoder; the decoder is symmetric to the encoder in terms of layer structure, with reconstruction loss for non-linear feature extraction and neural network with balanced classification loss for prognosis prediction to address data imbalance problems. Combined clinical data from patients with kidney cancer and gene data were used to determine the optimal classification model and estimate classification accuracy by sample type, primary diagnosis, tumor stage, and vital status as risk factors representing the state of patients. Experimental results showed that the COST-HDL approach was more efficient with gene expression data for kidney cancer prognosis than other conventional machine learning and data mining techniques. These results could be applied to extract features from gene biomarkers for prognosis prediction of kidney cancer and prevention and early diagnosis.

[1]  R. Figlin,et al.  A novel machine learning approach reveals latent vascular phenotypes predictive of renal cancer outcome , 2017, Scientific Reports.

[2]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[3]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[6]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[7]  Sung-Hou Kim,et al.  Prediction of inherited genomic susceptibility to 20 common cancer types by a supervised machine-learning method , 2018, Proceedings of the National Academy of Sciences.

[8]  Fabio Massimo Zanzotto,et al.  Breast Cancer Prognosis Using a Machine Learning Approach , 2019, Cancers.

[9]  Yi Li,et al.  Gene expression inference with deep learning , 2015, bioRxiv.

[10]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[11]  Gianni D'Angelo,et al.  A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees , 2019, Soft Computing.

[12]  Hanqi Zhuang,et al.  A Machine Learning Approach for the Classification of Kidney Cancer Subtypes Using miRNA Genome Data , 2018, Applied Sciences.

[13]  Andrew H. Beck,et al.  Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer , 2017, JAMA.

[14]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[15]  Xinghua Shi,et al.  A deep auto-encoder model for gene expression prediction , 2017, BMC Genomics.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Bin Chen,et al.  Selecting precise reference normal tissue samples for cancer research using a deep learning approach , 2018, BMC Medical Genomics.

[18]  Min Chen,et al.  Disease Prediction by Machine Learning Over Big Data From Healthcare Communities , 2017, IEEE Access.

[19]  Bong-Hyun Kim,et al.  Cancer classification of single-cell gene expression data by neural network , 2019, Bioinform..

[20]  Jonathan D. Beezley,et al.  Structured crowdsourcing enables convolutional segmentation of histology images , 2019, Bioinform..

[21]  Chenlei Leng,et al.  Shrinkage tuning parameter selection with a diverging number of parameters , 2008 .

[22]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[23]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[24]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[25]  N. Câmara,et al.  Kidney disease and obesity: epidemiology, mechanisms and treatment , 2017, Nature Reviews Nephrology.

[26]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .