Deep Learning-based Identification of Cancer or Normal Tissue using Gene Expression Data

Background: Deep learning has proven to show outstanding performance in resolving recognition and classification problems. As increasing amounts of cancer and normal gene expression data become publicly available, deep learning may become an integral component of efficiently finding specific patterns within massive datasets. Thus, we aim to address the extent to which the machine can learn to recognize cancer. We integrated cancer and normal tissue data from the Gene Expression Omnibus (GEO), The Cancer Gene Atlas (TCGA), Therapeutically Applicable Research To Generate Effective Treatments (TARGET), and Genotype-Tissue Expression (GTEx) databases, including 13,406 cancer and 12,842 normal gene expression data from 24 different tissues. We first trained the deep neural network (DNN) to discriminate between cancer and normal samples using various gene selection strategies and therapeutic target genes from commercial cancer panels and genes in NCI-curated cancer pathways. We also suggest systemic analyzation method to interpret trained deep neural network. We applied the method to find genes mostly contribute to classify cancer in an individual sample. Result: The best trained DNN could classify cancer and normal data with accuracy of 0.997 in the training data set of 13,123 (cancer: 6,703, normal: 6,402) samples. In the independent test set comprising 13,125 (cancer: 6,703, normal: 6,422) samples, the DNN model achieved 0.979 accuracy. Using the same training and test data, our DNN showed better performance than other conventional prediction methods, followed by the support vector machine approach. For interpretation, we propose a method that can extract a gene’s contribution to an individual sample’s cancer probability from the trained DNN. This method distinguished samples dependent on one or a few genes suggesting these samples are possibly}}{{\it “oncogene addicted”. Conclusion: A deep learning approach in conjunction with our interpretation method is not only a useful tool to identify cancer from gene expression data but can also contribute toward understanding the complex nature of cancer based on large public data.

[1]  Lana X. Garmire,et al.  Deep Learning based multi-omics integration robustly predicts survival in liver cancer , 2017, bioRxiv.

[2]  K. Tomczak,et al.  The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge , 2015, Contemporary oncology.

[3]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[4]  Reza Ghaeini,et al.  A Deep Learning Approach for Cancer Detection and Relevant Gene Identification , 2017, PSB.

[5]  May D. Wang,et al.  Comparison of RNA-seq and microarray-based models for clinical endpoint prediction , 2015, Genome Biology.

[6]  Kenneth H. Buetow,et al.  PID: the Pathway Interaction Database , 2008, Nucleic Acids Res..

[7]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[8]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[9]  Guang-Zhong Yang,et al.  Deep Learning for Health Informatics , 2017, IEEE Journal of Biomedical and Health Informatics.

[10]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[11]  Xiaolin Li,et al.  DeepCancer: Detecting Cancer via Deep Generative Learning Through Gene Expressions , 2016, 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech).

[12]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[13]  I. Weinstein Addiction to Oncogenes--the Achilles Heal of Cancer , 2002, Science.

[14]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Evgeny Putin,et al.  Deep biomarkers of human aging: Application of deep neural networks to biomarker development , 2016, Aging.

[16]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[17]  A. Joe,et al.  Oncogene addiction. , 2008, Cancer research.

[18]  Alex M. Fichtenholtz,et al.  Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing , 2013, Nature Biotechnology.

[19]  Khalid Auhmani,et al.  Gene-expression-based cancer classification through feature selection with KNN and SVM classifiers , 2015, 2015 Intelligent Systems and Computer Vision (ISCV).

[20]  Pablo Guillen,et al.  Cancer Classification Based on Microarray Gene Expression Data Using Deep Learning , 2016, 2016 International Conference on Computational Science and Computational Intelligence (CSCI).