A deep learning method for lincRNA detection using auto-encoder algorithm

BackgroundRNA sequencing technique (RNA-seq) enables scientists to develop novel data-driven methods for discovering more unidentified lincRNAs. Meantime, knowledge-based technologies are experiencing a potential revolution ignited by the new deep learning methods. By scanning the newly found data set from RNA-seq, scientists have found that: (1) the expression of lincRNAs appears to be regulated, that is, the relevance exists along the DNA sequences; (2) lincRNAs contain some conversed patterns/motifs tethered together by non-conserved regions. The two evidences give the reasoning for adopting knowledge-based deep learning methods in lincRNA detection. Similar to coding region transcription, non-coding regions are split at transcriptional sites. However, regulatory RNAs rather than message RNAs are generated. That is, the transcribed RNAs participate the biological process as regulatory units instead of generating proteins. Identifying these transcriptional regions from non-coding regions is the first step towards lincRNA recognition.ResultsThe auto-encoder method achieves 100% and 92.4% prediction accuracy on transcription sites over the putative data sets. The experimental results also show the excellent performance of predictive deep neural network on the lincRNA data sets compared with support vector machine and traditional neural network. In addition, it is validated through the newly discovered lincRNA data set and one unreported transcription site is found by feeding the whole annotated sequences through the deep learning machine, which indicates that deep learning method has the extensive ability for lincRNA prediction.ConclusionsThe transcriptional sequences of lincRNAs are collected from the annotated human DNA genome data. Subsequently, a two-layer deep neural network is developed for the lincRNA detection, which adopts the auto-encoder algorithm and utilizes different encoding schemes to obtain the best performance over intergenic DNA sequence data. Driven by those newly annotated lincRNA data, deep learning methods based on auto-encoder algorithm can exert their capability in knowledge learning in order to capture the useful features and the information correlation along DNA genome sequences for lincRNA detection. As our knowledge, this is the first application to adopt the deep learning techniques for identifying lincRNA transcription sequences.

[1]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  WangJianxin,et al.  DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly , 2015 .

[3]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[4]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[5]  C. Ponting,et al.  Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs. , 2007, Genome research.

[6]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[8]  Yi Pan,et al.  DNA AS X: An Information-Coding-Based Model to Improve the Sensitivity in Comparative Gene Analysis , 2015, ISBRA.

[9]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[10]  Cole Trapnell,et al.  Targeted RNA sequencing reveals the deep complexity of the human transcriptome , 2011, Nature Biotechnology.

[11]  D. Bartel,et al.  Conserved Function of lincRNAs in Vertebrate Embryonic Development despite Rapid Sequence Evolution , 2011, Cell.

[12]  Wing H Wong,et al.  The primate-specific noncoding RNA HPAT5 regulates pluripotency during human preimplantation development and nuclear reprogramming , 2015, Nature Genetics.

[13]  W. Chung,et al.  Genome-Wide Association Study in BRCA1 Mutation Carriers Identifies Novel Loci Associated with Breast and Ovarian Cancer Risk , 2013, PLoS genetics.

[14]  Michael T. McManus,et al.  Pervasive Transcription of the Human Genome Produces Thousands of Previously Unidentified Long Intergenic Noncoding RNAs , 2013, PLoS genetics.

[15]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[16]  Jianlin Cheng,et al.  Predicting protein residue-residue contacts using deep networks and boosting , 2012, Bioinform..

[17]  S. Batalov,et al.  Antisense Transcription in the Mammalian Transcriptome , 2005, Science.

[18]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[19]  Yi Zhao,et al.  Identification and function annotation of long intervening noncoding RNAs , 2016, Briefings Bioinform..

[20]  Mahmood Akhtar,et al.  Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction , 2008, IEEE Journal of Selected Topics in Signal Processing.

[21]  Yi Pan,et al.  DIME: A Novel Framework for De Novo Metagenomic Sequence Assembly , 2015, J. Comput. Biol..

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Gerhard Kauer,et al.  Applying signal theory to the analysis of biomolecules , 2003, Bioinform..

[24]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[25]  S. Sengupta,et al.  Genome-wide analysis reveals distinct patterns of epigenetic features in long non-coding RNA loci , 2012, Nucleic acids research.

[26]  David G. Knowles,et al.  The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression , 2012, Genome research.

[27]  P. Stadler,et al.  RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription , 2007, Science.