Improved prediction of smoking status via isoform-aware RNA-seq deep learning models

Most predictive models based on gene expression data do not leverage information related to gene splicing, despite the fact that splicing is a fundamental feature of eukaryotic gene expression. Cigarette smoking is an important environmental risk factor for many diseases, and it has profound effects on gene expression. Using smoking status as a prediction target, we developed deep neural network predictive models using gene, exon, and isoform level quantifications from RNA sequencing data in 2,557 subjects in the COPDGene Study. We observed that models using exon and isoform quantifications clearly outperformed gene-level models when using data from 5 genes from a previously published prediction model. Whereas the test set performance of the previously published model was 0.82 in the original publication, our exon-based models including an exon-to-isoform mapping layer achieved a test set AUC (area under the receiver operating characteristic) of 0.88, which improved to an AUC of 0.94 using exon quantifications from a larger set of genes. Isoform variability is an important source of latent information in RNA-seq data that can be used to improve clinical prediction models.

[1]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[2]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[4]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[5]  M. Peters,et al.  A whole-blood transcriptome meta-analysis identifies gene expression signatures of cigarette smoking. , 2016, Human molecular genetics.

[6]  Wolfgang Huber,et al.  Alternative start and termination sites of transcription drive most transcript isoform differences across human tissues , 2017, Nucleic acids research.

[7]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[8]  Aristotelis Tsirigos,et al.  A Deep Learning Framework for Predicting Response to Therapy in Cancer. , 2019, Cell reports.

[9]  Chris Williams,et al.  RNA-SeQC: RNA-seq metrics for quality control and process optimization , 2012, Bioinform..

[10]  May D. Wang,et al.  Comparison of RNA-seq and microarray-based models for clinical endpoint prediction , 2015, Genome Biology.

[11]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[12]  Shuifang Zhu,et al.  Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads , 2014, BMC Bioinformatics.

[13]  Mateusz Maciejewski,et al.  Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data , 2020, BMC Bioinformatics.

[14]  Andrew M. Gross,et al.  Network-based stratification of tumor mutations , 2013, Nature Methods.

[15]  Y. Shoenfeld,et al.  Effects of tobacco smoke on immunity, inflammation and autoimmunity. , 2010, Journal of autoimmunity.

[16]  M. Swanson,et al.  RNA mis-splicing in disease , 2015, Nature Reviews Genetics.

[17]  E. Regan,et al.  Genetic Epidemiology of COPD (COPDGene) Study Design , 2011, COPD.

[18]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[19]  Philip Beineke,et al.  A whole blood gene expression-based signature for smoking status , 2012, BMC Medical Genomics.

[20]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Sungroul Kim Overview of Cotinine Cutoff Values for Smoking Status Classification , 2016, International journal of environmental research and public health.

[22]  David A. Knowles,et al.  RNA splicing is a primary link between genetic variation and disease , 2016, Science.

[23]  Jennifer G. Dy,et al.  COPD subtypes identified by network-based clustering of blood gene expression. , 2016, Genomics.

[24]  Bonnie Berger,et al.  Making sense out of massive data by going beyond differential expression , 2012, Proceedings of the National Academy of Sciences.

[25]  E. Silverman,et al.  RNA sequencing identifies novel non-coding RNA and exon-specific effects associated with cigarette smoking , 2017, BMC Medical Genomics.

[26]  M. Cronin,et al.  A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. , 2004, The New England journal of medicine.

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..