Convolutional neural network models for cancer type prediction based on gene expression

Background Precise prediction of cancer types is vital for cancer diagnosis and therapy. Through a predictive model, important cancer marker genes can be inferred. Several studies have attempted to build machine learning models for this task however none has taken into consideration the effects of tissue of origin that can potentially bias the identification of cancer markers. Results In this paper, we introduced several Convolutional Neural Network (CNN) models that take unstructured gene expression inputs to classify tumor and non-tumor samples into their designated cancer types or as normal. Based on different designs of gene embeddings and convolution schemes, we implemented three CNN models: 1D-CNN, 2D-Vanilla-CNN, and 2D-Hybrid-CNN. The models were trained and tested on gene expression profiles from combined 10,340 samples of 33 cancer types and 713 matched normal tissues of The Cancer Genome Atlas (TCGA). Our models achieved excellent prediction accuracies (93.9–95.0%) among 34 classes (33 cancers and normal). Furthermore, we interpreted one of the models, 1D-CNN model, with a guided saliency technique and identified a total of 2090 cancer markers (108 per class on average). The concordance of differential expression of these markers between the cancer type they represent and others is confirmed. In breast cancer, for instance, our model identified well-known markers, such as GATA3 and ESR1 . Finally, we extended the 1D-CNN model for the prediction of breast cancer subtypes and achieved an average accuracy of 88.42% among 5 subtypes. The codes can be found at  https://github.com/chenlabgccri/CancerTypePrediction . Conclusions Here we present novel CNN designs for accurate and simultaneous cancer/normal and cancer types prediction based on gene expression profiles, and unique model interpretation scheme to elucidate biologically relevance of cancer marker genes after eliminating the effects of tissue-of-origin. The proposed model has light hyperparameters to be trained and thus can be easily adapted to facilitate cancer diagnosis in the future.

[1]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[2]  Yufei Huang,et al.  GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization , 2018, BMC Systems Biology.

[3]  Hao Sun,et al.  GeneCT: a generalizable cancerous status and tissue origin classifier for pan-cancer biopsies , 2018, Bioinform..

[4]  Allison P. Heath,et al.  Toward a Shared Vision for Cancer Genomic Data. , 2016, The New England journal of medicine.

[5]  A. Jemal,et al.  Cancer statistics, 2018 , 2018, CA: a cancer journal for clinicians.

[6]  Malaikannan Sankarasubbu,et al.  Convolutional Neural Networks In Classifying Cancer Through DNA Methylation , 2018, ArXiv.

[7]  Gianluca Bontempi,et al.  TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data , 2015, Nucleic acids research.

[8]  Yufei Huang,et al.  Predicting drug response of tumors from integrated genomic profiles by deep neural networks , 2018, BMC Medical Genomics.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yufei Huang,et al.  Deep learning of pharmacogenomics resources: moving towards precision oncology , 2019, Briefings Bioinform..

[11]  D. Botstein,et al.  For Personal Use. Only Reproduce with Permission from the Lancet Publishing Group , 2022 .

[12]  Steven J. M. Jones,et al.  Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer , 2015, Cell.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  David M. Umbach,et al.  A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data , 2017, BMC Genomics.

[15]  Xu Xiaoli,et al.  Characterisation of GATA3 expression in invasive breast cancer: differences in histological subtypes and immunohistochemically defined molecular subtypes , 2017, Journal of Clinical Pathology.

[16]  Xiujuan Lei,et al.  deepDriver: Predicting Cancer Driver Genes Based on Somatic Mutations Using Deep Convolutional Neural Networks , 2019, Front. Genet..

[17]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[18]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[19]  Mingming Jia,et al.  COSMIC: exploring the world's knowledge of somatic mutations in human cancer , 2014, Nucleic Acids Res..

[20]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[21]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[22]  Peter Gibbs,et al.  Early detection of cancer: past, present, and future. , 2015, American Society of Clinical Oncology educational book. American Society of Clinical Oncology. Annual Meeting.

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Arthur Liberzon,et al.  A description of the Molecular Signatures Database (MSigDB) Web site. , 2014, Methods in molecular biology.

[26]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[27]  Sam Angiuoli,et al.  Direct detection of early-stage cancers using circulating tumor DNA , 2017, Science Translational Medicine.

[28]  David Dagan Feng,et al.  Cancer type prediction based on copy number aberration and chromatin 3D structure with convolutional neural networks , 2018, BMC Genomics.

[29]  Biaoyang Lin,et al.  The program of androgen-responsive genes in neoplastic prostate epithelium , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Zhongwei Si,et al.  Learning Deep Features for DNA Methylation Data Analysis , 2016, IEEE Access.

[31]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Ludmila V. Danilova,et al.  Detection and localization of surgically resectable cancers with a multi-analyte blood test , 2018, Science.

[33]  M. Duffy,et al.  Predictive markers in breast and other cancers: a review. , 2005, Clinical chemistry.

[34]  Taesung Park,et al.  Deep Learning-based Identification of Cancer or Normal Tissue using Gene Expression Data , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[35]  Christopher G Chute,et al.  Classification, Ontology, and Precision Medicine. , 2018, The New England journal of medicine.

[36]  Boyu Lyu,et al.  Deep Learning Based Tumor Type Classification Using Gene Expression Data , 2018, bioRxiv.

[37]  Rebecca L. Siegel Mph,et al.  Cancer statistics, 2018 , 2018 .

[38]  Yang Guo,et al.  Identification of cancer subtypes by integrating multiple types of transcriptomics data with deep learning in breast cancer , 2019, Neurocomputing.