DeepIsoFun: a deep domain adaptation approach to predict isoform functions

Motivation Isoforms are mRNAs produced from the same gene locus by alternative splicing and may have different functions. Although gene functions have been studied extensively, little is known about the specific functions of isoforms. Recently, some computational approaches based on multiple instance learning have been proposed to predict isoform functions from annotated gene functions and expression data, but their performance is far from being desirable primarily due to the lack of labeled training data. To improve the performance on this problem, we propose a novel deep learning method, DeepIsoFun, that combines multiple instance learning with domain adaptation. The latter technique helps to transfer the knowledge of gene functions to the prediction of isoform functions and provides additional labeled training data. Our model is trained on a deep neural network architecture so that it can adapt to different expression distributions associated with different gene ontology terms. Results We evaluated the performance of DeepIsoFun on three expression datasets of human and mouse collected from SRA studies at different times. On each dataset, DeepIsoFun performed significantly better than the existing methods. In terms of area under the receiver operating characteristics curve (or AUC), our method acquired at least 26% improvement and in terms of area under the precision-recall curve (or AUPRC), it acquired at least 10% improvement over the state-of-the-art methods. In addition, we also study the divergence of the functions predicted by our method for isoforms from the same gene and the overall correlation between expression similarity and the similarity of predicted functions. Availability https://github.com/dls03/DeepIsoFun/.

[1]  Robert E. Schapire,et al.  Hierarchical multi-label prediction of gene function , 2006, Bioinform..

[2]  Saso Dzeroski,et al.  Predicting gene function using hierarchical multi-label decision tree ensembles , 2010, BMC Bioinformatics.

[3]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[4]  C. Boschek,et al.  Pyruvate kinase type M2 and its role in tumor growth and spreading. , 2005, Seminars in cancer biology.

[5]  François Laviolette,et al.  Domain-Adversarial Neural Networks , 2014, ArXiv.

[6]  John Moult,et al.  Stochastic noise in splicing machinery , 2009 .

[7]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[8]  Trevor Darrell,et al.  Deep Domain Confusion: Maximizing for Domain Invariance , 2014, CVPR 2014.

[9]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[10]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[11]  Veit Flockerzi,et al.  Alternative Splicing Switches the Divalent Cation Selectivity of TRPM3 Channels* , 2005, Journal of Biological Chemistry.

[12]  Yuanfang Guan,et al.  Genome-Wide Functional Annotation of Human Protein-Coding Splice Variants Using Multiple Instance Learning. , 2016, Journal of proteome research.

[13]  Charles R Sanders,et al.  Tailoring of membrane proteins by alternative splicing of pre-mRNA. , 2012, Biochemistry.

[14]  Roland Eils,et al.  Applying Support Vector Machines for Gene ontology based gene function prediction , 2004, BMC Bioinformatics.

[15]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[16]  Franck Bonnetain,et al.  Overexpression of Caspase-3s Splice Variant in Locally Advanced Breast Carcinoma Is Associated with Poor Response to Neoadjuvant Chemotherapy , 2006, Clinical Cancer Research.

[17]  P. Bouillet,et al.  CD95, BIM and T cell homeostasis , 2009, Nature Reviews Immunology.

[18]  Hongdong Li,et al.  MIsoMine: a genome-scale high-resolution data portal of expression, function and networks at the splice isoform level in the mouse , 2015, Database J. Biol. Databases Curation.

[19]  Yang Zhang,et al.  The I-TASSER Suite: protein structure and function prediction , 2014, Nature Methods.

[20]  Lior Pachter,et al.  Differential analysis of RNA-seq incorporating quantification uncertainty , 2016, Nature Methods.

[21]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[22]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[23]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[24]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[25]  Hongdong Li,et al.  Systematically Differentiating Functions for Alternatively Spliced Isoforms through Integrating RNA-seq Data , 2013, PLoS Comput. Biol..

[26]  L. Shkreta,et al.  Protein Kinase C-Dependent Control of Bcl-x Alternative Splicing , 2007, Molecular and Cellular Biology.

[27]  Yan Liu,et al.  High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method , 2013, Nucleic acids research.

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[30]  M. Cerdán,et al.  Two Proteins with Different Functions Are Derived from the KlHEM13 Gene , 2011, Eukaryotic Cell.

[31]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[32]  Zhihong Wu,et al.  CRISPR/Cas9 in zebrafish: an efficient combination for human genetic diseases modeling , 2016, Human Genetics.

[33]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[34]  Giorgio Valentini,et al.  GOssTo: a stand-alone application and a web tool for calculating semantic similarities on the Gene Ontology , 2014, Bioinform..

[35]  Xiu-Shen Wei,et al.  Scalable Algorithms for Multi-Instance Learning , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[36]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[37]  Kenshi Hayashi,et al.  Characterization of caspase-8L: a novel isoform of caspase-8 that behaves as an inhibitor of the caspase cascade. , 2002, Blood.

[38]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[39]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[40]  Jie Wang,et al.  Multiple-Instance Learning via an RBF Kernel-Based Extreme Learning Machine , 2017, J. Intell. Syst..

[41]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[42]  Anne-Claude Gingras,et al.  An alternative splicing event amplifies evolutionary differences between vertebrates , 2015, Science.

[43]  N. Barbosa-Morais,et al.  Alternative splicing: the pledge, the turn, and the prestige , 2017, Human Genetics.