DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learning

Abstract Motivation Alternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: (i) unlike genes, isoform-level functional annotations are scarce. (ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. Results In this study, we present a novel approach, DIFFUSE (Deep learning-based prediction of IsoForm FUnctions from Sequences and Expression), to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average area under the receiver operating characteristics curve of 0.840 and area under the precision–recall curve of 0.581 over 4184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences. Availability and implementation https://github.com/haochenucr/DIFFUSE. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  R. Huber,et al.  Zinc plays a key role in human and bacterial GTP cyclohydrolase I. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Terry Gaasterland,et al.  Alternative splicing of mouse transcription factors affects their DNA-binding domain architecture and is tissue specific , 2004, Genome Biology.

[3]  Dongyun Yi,et al.  Functional Annotation of Human Protein Coding Isoforms via Non-convex Multi-Instance Learning , 2017, KDD.

[4]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[5]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[6]  Davide Heller,et al.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences , 2015, Nucleic Acids Res..

[7]  F. Cimadamore,et al.  Tissue expression and biochemical characterization of human 2‐amino 3‐carboxymuconate 6‐semialdehyde decarboxylase, a key enzyme in tryptophan catabolism , 2007, The FEBS journal.

[8]  Piero Fariselli,et al.  Fast overlapping of protein contact maps by alignment of eigenvectors , 2010, Bioinform..

[9]  Hongdong Li,et al.  Systematically Differentiating Functions for Alternatively Spliced Isoforms through Integrating RNA-seq Data , 2013, PLoS Comput. Biol..

[10]  Thomas Hofmann,et al.  Multiple instance learning with generalized support vector machines , 2002, AAAI/IAAI.

[11]  Maxat Kulmanov,et al.  DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier , 2017, Bioinform..

[12]  Sarah A. Teichmann,et al.  Protein domain organisation: adding order , 2009, BMC Bioinformatics.

[13]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[15]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[16]  Y. Guan,et al.  The emerging era of genomic data integration for analyzing splice isoform function. , 2014, Trends in genetics : TIG.

[17]  Beilun Wang,et al.  Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, PSB.

[18]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[19]  V. Yuste,et al.  Identification and Characterization of AIFsh2, a Mitochondrial Apoptosis-inducing Factor (AIF) Isoform with NADH Oxidase Activity* , 2006, Journal of Biological Chemistry.

[20]  Philip R. Cohen,et al.  Alternative splicing regulates the production of ARD-1 endoribonuclease and NIPP-1, an inhibitor of protein phosphatase-1, as isoforms encoded by the same gene. , 1999, Gene.

[21]  K Ravi Acharya,et al.  Crystal structure of the N domain of human somatic angiotensin I-converting enzyme provides a structural basis for domain-specific inhibitor design. , 2006, Journal of molecular biology.

[22]  Radhey S. Gupta,et al.  Subcellular localization of adenosine kinase in mammalian cells: The long isoform of AdK is localized in the nucleus. , 2009, Biochemical and biophysical research communications.

[23]  Giorgio Valentini,et al.  GOssTo: a stand-alone application and a web tool for calculating semantic similarities on the Gene Ontology , 2014, Bioinform..

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[26]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[27]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[28]  Sheng Wang,et al.  Exploring the functional impact of alternative splicing on human protein isoforms using available annotation sources , 2019, Briefings Bioinform..

[29]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[30]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[31]  Prudence Mutowo-Meullenet,et al.  The GOA database: Gene Ontology annotation updates for 2015 , 2014, Nucleic Acids Res..

[32]  Hongdong Li,et al.  A proteogenomic approach to understand splice isoform functions through sequence and expression-based computational modeling , 2016, Briefings Bioinform..

[33]  Narmada Thanki,et al.  CDD: NCBI's conserved domain database , 2014, Nucleic Acids Res..

[34]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields , 2010, Found. Trends Mach. Learn..

[35]  Tao Jiang,et al.  TITER: predicting translation initiation sites by deep learning , 2017, bioRxiv.

[36]  A. Elofsson,et al.  Structure is three to ten times more conserved than sequence—A study of structural response in protein cores , 2009, Proteins.

[37]  Alan Bridge,et al.  The UniProtKB guide to the human proteome , 2016, Database J. Biol. Databases Curation.

[38]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy , 2011, Nucleic Acids Res..

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[41]  Hao Chen,et al.  DeepIsoFun: a deep domain adaptation approach to predict isoform functions , 2018, Bioinform..

[42]  Lawrence Hunter,et al.  Pacific symposium on biocomputing 2006 , 2005, PSB 2016.

[43]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Xinchen Wang,et al.  Tissue-specific alternative splicing remodels protein-protein interaction networks. , 2012, Molecular cell.

[45]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[46]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[47]  Jinbo Xu,et al.  Raptorx: Exploiting structure information for protein alignment by statistical inference , 2011, Proteins.

[48]  Tao Jiang,et al.  SDEAP: a splice graph based differential transcript expression analysis tool for population data , 2016, Bioinform..

[49]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[50]  Yan Liu,et al.  High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method , 2013, Nucleic acids research.

[51]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[52]  Stan Matwin,et al.  Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2017, KDD.