Transfer learning with convolutional neural networks for cancer survival prediction using gene-expression data

Precision medicine in oncology aims at obtaining data from heterogeneous sources to have a precise estimation of a given patient’s state and prognosis. With the purpose of advancing to personalized medicine framework, accurate diagnoses allow prescription of more effective treatments adapted to the specificities of each individual case. In the last years, next-generation sequencing has impelled cancer research by providing physicians with an overwhelming amount of gene-expression data from RNA-seq high-throughput platforms. In this scenario, data mining and machine learning techniques have widely contribute to gene-expression data analysis by supplying computational models to supporting decision-making on real-world data. Nevertheless, existing public gene-expression databases are characterized by the unfavorable imbalance between the huge number of genes (in the order of tenths of thousands) and the small number of samples (in the order of a few hundreds) available. Despite diverse feature selection and extraction strategies have been traditionally applied to surpass derived over-fitting issues, the efficacy of standard machine learning pipelines is far from being satisfactory for the prediction of relevant clinical outcomes like follow-up end-points or patient’s survival. Using the public Pan-Cancer dataset, in this study we pre-train convolutional neural network architectures for survival prediction on a subset composed of thousands of gene-expression samples from thirty-one tumor types. The resulting architectures are subsequently fine-tuned to predict lung cancer progression-free interval. The application of convolutional networks to gene-expression data has many limitations, derived from the unstructured nature of these data. In this work we propose a methodology to rearrange RNA-seq data by transforming RNA-seq samples into gene-expression images, from which convolutional networks can extract high-level features. As an additional objective, we investigate whether leveraging the information extracted from other tumor-type samples contributes to the extraction of high-level features that improve lung cancer progression prediction, compared to other machine learning approaches.

[1]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Hemant Ishwaran,et al.  Random survival forests for competing risks. , 2014, Biostatistics.

[4]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[5]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  P. Lapuerta,et al.  Comparison of the performance of neural network methods and Cox regression for censored survival data , 2000 .

[8]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[9]  Q. Cui,et al.  Identification of high-quality cancer prognostic markers and metastasis network modules , 2010, Nature communications.

[10]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[11]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[12]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[13]  M. Ghazisaeedi,et al.  Improving the Prediction of Survival in Cancer Patients by Using Machine Learning Techniques: Experience of Gene Expression Data: A Narrative Review , 2017, Iranian journal of public health.

[14]  Yanchun Liang,et al.  MusiteDeep: a deep‐learning framework for general and kinase‐specific phosphorylation site prediction , 2017, Bioinform..

[15]  Le Yang,et al.  Deep-learning approach to identifying cancer subtypes using high-dimensional genomic data , 2019, Bioinform..

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Jinfeng Zou,et al.  Identification and Construction of Combinatory Cancer Hallmark-Based Gene Signature Sets to Predict Recurrence and Chemotherapy Benefit in Stage II Colorectal Cancer. , 2016, JAMA oncology.

[19]  Bart Baesens,et al.  Neural network survival analysis for personal loan data , 2005, J. Oper. Res. Soc..

[20]  Fei Wang,et al.  Deep learning for healthcare: review, opportunities and challenges , 2018, Briefings Bioinform..

[21]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[22]  Yan Cui,et al.  Transfer Learning for Molecular Cancer Classification Using Deep Neural Networks , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[24]  Mathew W. Wright,et al.  The HUGO Gene Nomenclature Committee (HGNC) , 2001, Human Genetics.

[25]  M. Kanehisa,et al.  BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. , 2016, Journal of molecular biology.

[26]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[27]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[28]  Boyu Lyu,et al.  Deep Learning Based Tumor Type Classification Using Gene Expression Data , 2018, bioRxiv.

[29]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[30]  Daniel Urda,et al.  MetODeep: A Deep Learning Approach for Prediction of Methionine Oxidation Sites in Proteins , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[31]  Tatsuhiko Tsunoda,et al.  DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture , 2019, Scientific Reports.

[32]  D. Hanahan,et al.  Hallmarks of Cancer: The Next Generation , 2011, Cell.

[33]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[34]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yufei Huang,et al.  GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization , 2018, BMC Systems Biology.

[36]  Reza Ghaeini,et al.  A Deep Learning Approach for Cancer Detection and Relevant Gene Identification , 2017, PSB.

[37]  Adrian V. Lee,et al.  An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics , 2018, Cell.

[38]  Zhen Zhang,et al.  OmicsMapNet: Transforming omics data to take advantage of Deep Convolutional Neural Network for discovery , 2018, ArXiv.

[39]  George W. Sledge,et al.  A Multigene Expression Assay to Predict Local Recurrence Risk for Ductal Carcinoma In Situ of the Breast , 2013, Journal of the National Cancer Institute.

[40]  Ben Shneiderman,et al.  Ordered and quantum treemaps: Making effective use of 2D space to display hierarchies , 2002, TOGS.

[41]  Jun Wu,et al.  A semi-supervised deep learning method based on stacked sparse auto-encoder for cancer prediction using RNA-seq data , 2018, Comput. Methods Programs Biomed..

[42]  Cesare Furlanello,et al.  Phylogenetic convolutional neural networks in metagenomics , 2017, BMC Bioinformatics.

[43]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[44]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[45]  Ben Shneiderman,et al.  Tree-maps: a space-filling approach to the visualization of hierarchical information structures , 1991, Proceeding Visualization '91.

[46]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[47]  Tieliu Shi,et al.  Deep Learning-Based Multi-Omics Data Integration Reveals Two Prognostic Subtypes in High-Risk Neuroblastoma , 2018, Front. Genet..

[48]  Yan Guo,et al.  Architectures and accuracy of artificial neural network for disease classification from omics data , 2019, BMC Genomics.

[49]  Joshua E. Lewis,et al.  Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models , 2017, Scientific Reports.

[50]  José M. Jerez,et al.  A Transfer-Learning Approach to Feature Extraction from Cancer Transcriptomes with Deep Autoencoders , 2019, IWANN.

[51]  Casey S. Greene,et al.  Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders , 2017, bioRxiv.

[52]  L. V. van't Veer,et al.  70-Gene Signature as an Aid to Treatment Decisions in Early-Stage Breast Cancer. , 2016, The New England journal of medicine.