LncRNAnet: long non‐coding RNA identification using deep learning

Motivation Long non‐coding RNAs (lncRNAs) are important regulatory elements in biological processes. LncRNAs share similar sequence characteristics with messenger RNAs, but they play completely different roles, thus providing novel insights for biological studies. The development of next‐generation sequencing has helped in the discovery of lncRNA transcripts. However, the experimental verification of numerous transcriptomes is time consuming and costly. To alleviate these issues, a computational approach is needed to distinguish lncRNAs from the transcriptomes. Results We present a deep learning‐based approach, lncRNAnet, to identify lncRNAs that incorporates recurrent neural networks for RNA sequence modeling and convolutional neural networks for detecting stop codons to obtain an open reading frame indicator. lncRNAnet performed clearly better than the other tools for sequences of short lengths, on which most lncRNAs are distributed. In addition, lncRNAnet successfully learned features and showed 7.83%, 5.76%, 5.30% and 3.78% improvements over the alternatives on a human test set in terms of specificity, accuracy, F1‐score and area under the curve, respectively. Availability and implementation Data and codes are available in http://data.snu.ac.kr/pub/lncRNAnet.

[1]  Hui Xiao,et al.  NONCODE v3.0: integrative annotation of long noncoding RNAs , 2011, Nucleic Acids Res..

[2]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[3]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[4]  K. Struhl Transcriptional noise and the fidelity of initiation by RNA polymerase II , 2007, Nature Structural &Molecular Biology.

[5]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[6]  K. Sun,et al.  iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data , 2013, BMC Genomics.

[7]  Jeannie T. Lee Epigenetic Regulation by Long Noncoding RNAs , 2012, Science.

[8]  D. Spector,et al.  Long noncoding RNAs: functional surprises from the RNA world. , 2009, Genes & development.

[9]  Yuan Zhang,et al.  LncRNA-ID: Long non-coding RNA IDentification using balanced random forests , 2015, Bioinform..

[10]  Tomas Mikolov,et al.  Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[11]  Benjamin Schrauwen,et al.  Deep content-based music recommendation , 2013, NIPS.

[12]  C. Ponting,et al.  Evolution and Functions of Long Noncoding RNAs , 2009, Cell.

[13]  T. Aune,et al.  Expression and functions of long noncoding RNAs during human T helper cell differentiation , 2015, Nature Communications.

[14]  Sungroh Yoon,et al.  Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity , 2018, Nature Biotechnology.

[15]  Riccardo Velasco,et al.  Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation , 2013, BMC Genomics.

[16]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[17]  M. Gerstein,et al.  Annotating non-coding regions of the genome , 2010, Nature Reviews Genetics.

[18]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[19]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[20]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[21]  B. Rost,et al.  Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines , 2006, PLoS genetics.

[22]  Philipp Kapranov,et al.  Dark Matter RNA: Existence, Function, and Controversy , 2012, Front. Gene..

[23]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[24]  Sartaj Sahni,et al.  Multicore and GPU algorithms for Nussinov RNA folding , 2014, BMC Bioinformatics.

[25]  Yaohang Li,et al.  Template-based C8-SCORPION: a protein 8-state secondary structure prediction method using structural information and context-based features , 2014, BMC Bioinformatics.

[26]  Byunghan Lee,et al.  Advance Access Publication Date: Day Month Year Manuscript Category Deeptarget: End-to-end Learning Framework for Microrna Target Prediction Using Deep Recurrent Neural Networks , 2022 .

[27]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[28]  J. Mattick,et al.  Non-coding RNA. , 2006, Human molecular genetics.

[29]  Jeannie T. Lee,et al.  Long Noncoding RNAs: Past, Present, and Future , 2013, Genetics.

[30]  S. Cooper Chapter 6. , 1887, Interviews with Rudolph A Marcus on Electron Transfer Reactions.

[31]  J. Mattick Non‐coding RNAs: the architects of eukaryotic complexity , 2001, EMBO reports.

[32]  Boonserm Kaewkamnerdpong,et al.  Identification of non-coding RNAs with a new composite feature in the Hybrid Random Forest Ensemble algorithm , 2014, Nucleic acids research.

[33]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[34]  A. Bhan,et al.  Long Noncoding RNAs: Emerging Stars in Gene Regulation, Epigenetics and Human Disease , 2014, ChemMedChem.

[35]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[38]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Cong Pian,et al.  LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature , 2016, PloS one.

[42]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[43]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[44]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[45]  Howard Y. Chang,et al.  Unique features of long non-coding RNA biogenesis and function , 2015, Nature Reviews Genetics.

[46]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.

[47]  Olga Radyvonenko,et al.  Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization , 2016, 2016 IEEE First International Conference on Data Stream Mining & Processing (DSMP).

[48]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[49]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[50]  Seunghyun Park,et al.  Deep Recurrent Neural Network-Based Identification of Precursor microRNAs , 2017, NIPS.

[51]  Tim R. Mercer,et al.  Differentiating Protein-Coding and Noncoding RNA: Challenges and Ambiguities , 2008, PLoS Comput. Biol..

[52]  Howard Y. Chang,et al.  Molecular mechanisms of long noncoding RNAs. , 2011, Molecular cell.

[53]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[54]  David G. Knowles,et al.  The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression , 2012, Genome research.

[55]  Howard Y. Chang,et al.  Long noncoding RNAs and human disease. , 2011, Trends in cell biology.

[56]  Pritish Kumar Varadwaj,et al.  DeepLNC, a long non-coding RNA prediction tool using deep neural network , 2016, Network Modeling Analysis in Health Informatics and Bioinformatics.

[57]  J. Mattick,et al.  Rapid evolution of noncoding RNAs: lack of conservation does not mean lack of function. , 2006, Trends in genetics : TIG.

[58]  Howard Y. Chang,et al.  Corrigendum: Long noncoding RNAs and human disease: [Trends in Cell Biology 21 (2011), 354–361] , 2011 .

[59]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.