DeepCPP: a deep neural network based on nucleotide bias information and minimum distribution similarity feature selection for RNA coding potential prediction

The development of deep sequencing technologies has led to the discovery of novel transcripts. Many in silico methods have been developed to assess the coding potential of these transcripts to further investigate their functions. Existing methods perform well on distinguishing majority long noncoding RNAs (lncRNAs) and coding RNAs (mRNAs) but poorly on RNAs with small open reading frames (sORFs). Here, we present DeepCPP (deep neural network for coding potential prediction), a deep learning method for RNA coding potential prediction. Extensive evaluations on four previous datasets and six new datasets constructed in different species show that DeepCPP outperforms other state-of-the-art methods, especially on sORF type data, which overcomes the bottleneck of sORF mRNA identification by improving more than 4.31, 37.24 and 5.89% on its accuracy for newly discovered human, vertebrate and insect data, respectively. Additionally, we also revealed that discontinuous k-mer, and our newly proposed nucleotide bias and minimal distribution similarity feature selection method play crucial roles in this classification problem. Taken together, DeepCPP is an effective method for RNA coding potential prediction.

[1]  Sachi Inagaki,et al.  Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA , 2007, Nature Cell Biology.

[2]  Liu Ming,et al.  A novel protein encoded by the circular form of the SHPRH gene suppresses glioma tumorigenesis , 2018, Oncogene.

[3]  P. Bork,et al.  Quantification of insect genome divergence. , 2007, Trends in genetics : TIG.

[4]  Y.G. Zheng,et al.  Increased expression of long noncoding RNA LINC00961 suppresses glioma metastasis and correlates with favorable prognosis. , 2018, European review for medical and pharmacological sciences.

[5]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[6]  K. Lindblad-Toh,et al.  FEELnc: a tool for long non-coding RNA annotation and its application to the dog transcriptome , 2017, Nucleic acids research.

[7]  John M. Shelton,et al.  A Micropeptide Encoded by a Putative Long Noncoding RNA Regulates Muscle Performance , 2015, Cell.

[8]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[9]  Xiaoxue Tong,et al.  CPPred: coding potential prediction based on the global description of RNA sequence , 2019, Nucleic acids research.

[10]  Stephen C. Cannon,et al.  A peptide encoded by a transcript annotated as long noncoding RNA enhances SERCA activity in muscle , 2016, Science.

[11]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[12]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[13]  A. Kochetov AUG codons at the beginning of protein coding sequences are frequent in eukaryotic mRNAs with a suboptimal start codon context , 2005, Bioinform..

[14]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[15]  Byunghan Lee,et al.  LncRNAnet: long non‐coding RNA identification using deep learning , 2018, Bioinform..

[16]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[17]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[18]  Guang-Rong Yan,et al.  A Peptide Encoded by a Putative lncRNA HOXB-AS3 Suppresses Colon Cancer Growth. , 2017, Molecular cell.

[19]  Antonio Pires de Camargo,et al.  RNAsamba: coding potential assessment using ORF and whole transcript sequence information , 2019, bioRxiv.

[20]  Laurent Gil,et al.  Ensembl variation resources , 2018, Database J. Biol. Databases Curation.

[21]  Pritish Kumar Varadwaj,et al.  DeepLNC, a long non-coding RNA prediction tool using deep neural network , 2016, Network Modeling Analysis in Health Informatics and Bioinformatics.

[22]  May D. Wang,et al.  LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning , 2018, Bioinform..

[23]  M. Fullwood,et al.  Inflated performance measures in enhancer–promoter interaction-prediction methods , 2019, Nature Genetics.

[24]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[25]  Ling-Ling Chen,et al.  SLERT Regulates DDX21 Rings Associated with Pol I Transcription , 2017, Cell.

[26]  Suyun Huang,et al.  Novel Role of FBXW7 Circular RNA in Repressing Glioma Tumorigenesis , 2017, Journal of the National Cancer Institute.

[27]  Jiao Ma,et al.  A human microprotein that interacts with the mRNA decapping complex , 2016, Nature chemical biology.

[28]  Padideh Danaee,et al.  A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential , 2017, bioRxiv.

[29]  Jean-Michel Claverie,et al.  The Difficulty of Identifying Genes in Anonymous Vertebrate Sequences , 1997, Comput. Chem..

[30]  J. Fickett Recognition of protein coding regions in DNA sequences. , 1982, Nucleic acids research.

[31]  T. Gojobori,et al.  Diversity of preferred nucleotide sequences around the translation initiation codon in eukaryote genomes , 2007, Nucleic acids research.

[32]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[33]  P. Pandolfi,et al.  A coding-independent function of gene and pseudogene mRNAs regulates tumour biology , 2010, Nature.

[34]  J. Lawrence,et al.  XIST RNA: a window into the broader role of RNA in nuclear chromosome architecture , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[35]  Akinobu Matsumoto,et al.  mTORC1 and muscle regeneration are regulated by the LINC00961-encoded SPAR polypeptide , 2016, Nature.

[36]  Peter F Stadler,et al.  A Support Vector Machine based method to distinguish long non-coding RNAs from protein coding transcripts , 2017, BMC Genomics.

[37]  N. Rajewsky,et al.  Circ-ZNF609 Is a Circular RNA that Can Be Translated and Functions in Myogenesis , 2017, Molecular cell.

[38]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[39]  I. Grummt,et al.  LncRNA Khps1 Regulates Expression of the Proto-oncogene SPHK1 via Triplex-Mediated Changes in Chromatin Structure. , 2015, Molecular cell.

[40]  O. A. Volkova,et al.  Interrelations between the Nucleotide Context of Human Start AUG Codon, N-end Amino Acids of the Encoded Protein and Initiation of Translation , 2010, Journal of biomolecular structure & dynamics.

[41]  C. Hellen,et al.  Specific functional interactions of nucleotides at key -3 and +4 positions flanking the initiation codon with components of the mammalian 48S translation initiation complex. , 2006, Genes & development.

[42]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[43]  J. Couso,et al.  Classification and function of small open reading frames , 2017, Nature Reviews Molecular Cell Biology.

[44]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Tetsuya Sakurai,et al.  sORF finder: a program package to identify small open reading frames with high coding potential , 2010, Bioinform..

[46]  Ge Gao,et al.  CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features , 2017, Nucleic Acids Res..

[47]  N. Brockdorff,et al.  PCGF3/5–PRC1 initiates Polycomb recruitment in X chromosome inactivation , 2017, Science.

[48]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[49]  M. McCarthy,et al.  Human β cell transcriptome analysis uncovers lncRNAs that are tissue-specific, dynamically regulated, and abnormally expressed in type 2 diabetes. , 2012, Cell metabolism.

[50]  Yanchun Liang,et al.  LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property , 2018, Briefings Bioinform..

[51]  Song Zhu,et al.  Peptides/Proteins Encoded by Non-coding RNA: A Novel Resource Bank for Drug Targets and Biomarkers , 2018, Front. Pharmacol..

[52]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.