Deep Learning Models Based on Distributed Feature Representations for Alternative Splicing Prediction

Alternative splicing (AS) is a fundamental step in mRNA maturation and gene expression. The advancement in RNA sequencing technologies has shed light on the role of AS in increasing protein isoform diversity. AS is recognized to be involved in the regulation of both physiological and pathological functions, hence it is an essential part of the study of gene regulation development and diseases. With the recent advances in machine learning, there is an interest in developing accurate deep learning based computational models for AS prediction. In this paper, we propose a convolutional neural network and multilayer perceptron models to tackle the AS prediction task as classification and regression. These models use feature representations learned from genomic data and cellular context. Unlike previous works which use hand-crafted feature extraction, we propose an automatic feature learning approach to avoid explicit and predefined feature extraction. The proposed approach is based on the adaptation of two extensively used natural language processing techniques, namely word2vec and doc2vec. In order to understand the effects of different representation learning techniques, many experiments have been conducted to predict AS based on the cassette exons and cell type. Overall, experimental results on five tissues data set prove that learning features from genome sequence add a significant improvement to AS outcome prediction in both classification and regression tasks.

[1]  Brendan J. Frey,et al.  Bayesian prediction of tissue-regulated splicing using RNA sequence and cellular context , 2011, Bioinform..

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yanmei Xu,et al.  Mechanism of alternative splicing and its regulation (Review) , 2015 .

[4]  Christopher J. Lee,et al.  Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. , 2002, Nucleic acids research.

[5]  Juan González-Vallinas,et al.  A new view of transcriptome complexity and regulation through the lens of local splicing variations , 2016, eLife.

[6]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[7]  Brendan J. Frey,et al.  Deciphering the splicing code , 2010, Nature.

[8]  C. Emmeche,et al.  From language to nature: The semiotic metaphor in biology , 1991 .

[9]  S. Bergmann,et al.  The evolution of gene expression levels in mammalian organs , 2011, Nature.

[10]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[11]  Yoseph Barash,et al.  Integrative deep models for alternative splicing , 2017, bioRxiv.

[12]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[13]  Weijun Gao,et al.  AVISPA: a web tool for the prediction and analysis of alternative splicing , 2013, Genome Biology.

[14]  Yan Wang,et al.  Mechanism of alternative splicing and its regulation. , 2015, Biomedical reports.

[15]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[16]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[17]  Matthew R. Gazzara,et al.  In silico to in vivo splicing analysis using splicing code models. , 2014, Methods.

[18]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[20]  Luis Terán,et al.  Who You Should Not Follow: Extracting Word Embeddings from Tweets to Identify Groups of Interest and Hijackers in Demonstrations , 2019, IEEE Transactions on Emerging Topics in Computing.

[21]  Lijun Liu,et al.  Sentiment Analysis Using Convolutional Neural Network , 2015, 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing.

[22]  K. Bretonnel Cohen,et al.  Natural Language Processing and Systems Biology , 2004, Artificial Intelligence Methods And Tools For Systems Biology.

[23]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[24]  William H. Majoros,et al.  Genomics and natural language processing , 2002, Nature Reviews Genetics.

[25]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[26]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  David B. Searls,et al.  String Variable Grammar: A Logic Grammar Formalism for the Biological Language of DNA , 1995, J. Log. Program..

[30]  D. Black Protein Diversity from Alternative Splicing A Challenge for Bioinformatics and Post-Genome Biology , 2000, Cell.