lncRNA-MFDL: identification of human long non-coding RNAs by fusing multiple features and using deep learning.

Long noncoding RNAs (lncRNAs) are emerging as a novel class of noncoding RNAs and potent gene regulators, which play an important and varied role in cellular functions. lncRNAs are closely related with the occurrence and development of some diseases. High-throughput RNA-sequencing techniques combined with de novo assembly have identified a large number of novel transcripts. The discovery of large and 'hidden' transcriptomes urgently requires the development of effective computational methods that can rapidly distinguish between coding and long noncoding RNAs. In this study, we developed a powerful predictor (named as lncRNA-MFDL) to identify lncRNAs by fusing multiple features of the open reading frame, k-mer, the secondary structure and the most-like coding domain sequence and using deep learning classification algorithms. Using the same human training dataset and a 10-fold cross validation test, lncRNA-MFDL can achieve 97.1% prediction accuracy which is 5.7, 3.7, and 3.4% higher than that of CPC, CNCI and lncRNA-FMFSVM predictors, respectively. Compared with CPC and CNCI predictors in other species (e.g., anole lizard, zebrafish, chicken, gorilla, macaque, mouse, lamprey, orangutan, xenopus and C. elegans) testing datasets, the new lncRNA-MFDL predictor is also much more effective and robust. These results show that lncRNA-MFDL is a powerful tool for identifying lncRNAs. The lncRNA-MFDL software package is freely available at for academic users.

[1]  Shao-Wu Zhang,et al.  MSLoc-DT: a new method for predicting the protein subcellular location of multispecies based on decision templates. , 2014, Analytical biochemistry.

[2]  Xi Chen,et al.  Computational identification of human long intergenic non-coding RNAs using a GA-SVM algorithm. , 2014, Gene.

[3]  Hongbo Liu,et al.  Long non-coding RNA identification over mouse brain development by integrative modeling of chromatin and genomic features , 2013, Nucleic acids research.

[4]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[5]  K. Sun,et al.  iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data , 2013, BMC Genomics.

[6]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[7]  E. Pennisi Genomics. ENCODE project writes eulogy for junk DNA. , 2012, Science.

[8]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[9]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  J. Rinn,et al.  Modular regulatory principles of large non-coding RNAs , 2012, Nature.

[11]  Quan Pan,et al.  Identification of protein-RNA interaction sites using the information of spatial adjacent residues , 2011, Proteome Science.

[12]  D. Cacchiarelli,et al.  A Long Noncoding RNA Controls Muscle Differentiation by Functioning as a Competing Endogenous RNA , 2011, Cell.

[13]  Ming-Ming Zhou,et al.  Long noncoding RNA, polycomb, and the ghosts haunting INK4b-ARF-INK4a expression. , 2011, Cancer research.

[14]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[15]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[16]  Roberto T. Arrial,et al.  Screening non-coding RNAs in transcriptomes from neglected species using PORTRAIT: case study of the pathogenic fungus Paracoccidioides brasiliensis , 2009, BMC Bioinformatics.

[17]  Yang Zhang,et al.  MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming , 2009, Nucleic acids research.

[18]  Tim R. Mercer,et al.  Differentiating Protein-Coding and Noncoding RNA: Challenges and Ambiguities , 2008, PLoS Comput. Biol..

[19]  Shao-Wu Zhang,et al.  Using the concept of Chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies , 2008, Amino Acids.

[20]  Shao-Wu Zhang,et al.  Using Chou’s pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach , 2008, Amino Acids.

[21]  James A. Cuff,et al.  Distinguishing protein-coding and noncoding genes in the human genome , 2007, Proceedings of the National Academy of Sciences.

[22]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[23]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24]  Ian Stansfield,et al.  tRNA properties help shape codon pair preferences in open reading frames , 2006, Nucleic acids research.

[25]  Guy Riddihough,et al.  In the Forests of RNA Dark Matter , 2005, Science.

[26]  Martin C Frith,et al.  Genomics: The amazing complexity of the human transcriptome , 2005, European Journal of Human Genetics.

[27]  E. Schadt,et al.  Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. , 2005, Trends in genetics : TIG.

[28]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[29]  Ivo L. Hofacker,et al.  Vienna RNA secondary structure server , 2003, Nucleic Acids Res..

[30]  Dong Yu,et al.  Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP] , 2011, IEEE Signal Processing Magazine.