Characterization and identification of long non-coding RNAs based on feature relationship

Motivation The significance of long non-coding RNAs (lncRNAs) in many biological processes and diseases has gained intense interests over the past several years. However, computational identification of lncRNAs in a wide range of species remains challenging; it requires prior knowledge of well-established sequences and annotations or species-specific training data, but the reality is that only a limited number of species have high-quality sequences and annotations. Results Here we first characterize lncRNAs by contrast to protein-coding RNAs based on feature relationship and find that the feature relationship between ORF (open reading frame) length and GC content presents universally substantial divergence in lncRNAs and protein-coding RNAs, as observed in a broad variety of species. Based on the feature relationship, accordingly, we further present LGC, a novel algorithm for identifying lncRNAs that is able to accurately distinguish lncRNAs from protein-coding RNAs in a cross-species manner without any prior knowledge. As validated on large-scale empirical datasets, comparative results show that LGC outperforms existing algorithms by achieving higher accuracy, well-balanced sensitivity and specificity, and is robustly effective (>90% accuracy) in discriminating lncRNAs from protein-coding RNAs across diverse species that range from plants to mammals. To our knowledge, this study, for the first time, differentially characterizes lncRNAs and protein-coding RNAs based on feature relationship, which is further applied in computational identification of lncRNAs. Taken together, our study represents a significant advance in characterization and identification of lncRNAs and LGC thus bears broad potential utility for computational analysis of lncRNAs in a wide range of species. Availability LGC web server is publicly available at http://bigd.big.ac.cn/lgc/calculator. The scripts and data can be downloaded at http://bigd.big.ac.cn/biocode/tools/BT000004. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Ross C Hardison,et al.  Lineage and species-specific long noncoding RNAs during erythro-megakaryocytic development. , 2014, Blood.

[2]  K. Sun,et al.  iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data , 2013, BMC Genomics.

[3]  Marcel E. Dinger,et al.  lncRNAdb v2.0: expanding the reference database for functional long noncoding RNAs , 2014, Nucleic Acids Res..

[4]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[5]  D. Spector,et al.  Long noncoding RNAs: functional surprises from the RNA world. , 2009, Genes & development.

[6]  Yang Zhang,et al.  Database Resources of the BIG Data Center in 2018 , 2017, Nucleic Acids Res..

[7]  David G. Knowles,et al.  The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression , 2012, Genome research.

[8]  Vladimir B. Bajic,et al.  Promoter Analysis Reveals Globally Differential Regulation of Human Long Non-Coding RNA and Protein-Coding Genes , 2014, PloS one.

[9]  Jonathan M. Mudge,et al.  Creating reference gene annotation for the mouse C57BL6/J genome assembly , 2015, Mammalian Genome.

[10]  A. Eyre-Walker,et al.  Synonymous codon bias is related to gene length in Escherichia coli: selection for translational accuracy? , 1996, Molecular biology and evolution.

[11]  Elizabeth Pennisi,et al.  Shining a light on the genome's 'dark matter'. , 2010, Science.

[12]  Melissa J. Fullwood,et al.  Roles, Functions, and Mechanisms of Long Non-coding RNAs in Cancer , 2016, Genom. Proteom. Bioinform..

[13]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[14]  S. Salzberg,et al.  The Transcriptional Landscape of the Mammalian Genome , 2005, Science.

[15]  P. Senapathy,et al.  Origin of eukaryotic introns: a hypothesis, based on codon distribution statistics in genes, and its implications. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[16]  B. Rost,et al.  Distinguishing Protein-Coding from Non-Coding RNAs through Support Vector Machines , 2006, PLoS genetics.

[17]  Jun Yu,et al.  LncRNAWiki: harnessing community knowledge in collaborative curation of human long non-coding RNAs , 2014, Nucleic Acids Res..

[18]  Wei Wu,et al.  NONCODE 2016: an informative and valuable data source of long non-coding RNAs , 2015, Nucleic Acids Res..

[19]  Jie Wu,et al.  deepBase v2.0: identification, expression, evolution and function of small RNAs, LncRNAs and circular RNAs from deep-sequencing data , 2015, Nucleic Acids Res..

[20]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[21]  Xuhua Xia,et al.  Effects of GC Content and Mutational Pressure on the Lengths of Exons and Coding Sequences , 2003, Journal of Molecular Evolution.

[22]  Nick Goldman,et al.  RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. , 2011, RNA.

[23]  S. Dhanasekaran,et al.  The landscape of long noncoding RNAs in the human transcriptome , 2015, Nature Genetics.

[24]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[25]  J. Oliver,et al.  A relationship between GC content and coding-sequence length , 1996, Journal of Molecular Evolution.

[26]  Manolis Kellis,et al.  PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions , 2011, Bioinform..

[27]  Lennart Martens,et al.  An update on LNCipedia: a database for annotated human lncRNA sequences , 2014, Nucleic acids research.

[28]  Yi Zhao,et al.  Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts , 2013, Nucleic acids research.

[29]  Vladimir B. Bajic,et al.  FARNA: knowledgebase of inferred functions of non-coding RNA transcripts , 2016, Nucleic acids research.

[30]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.

[31]  V. Bajic,et al.  DES-ncRNA: A knowledgebase for exploring information about human micro and long noncoding RNAs based on literature-mining , 2017, RNA biology.

[32]  Yuan Zhang,et al.  LncRNA-ID: Long non-coding RNA IDentification using balanced random forests , 2015, Bioinform..

[33]  C. Mora,et al.  How Many Species Are There on Earth and in the Ocean? , 2011, PLoS biology.

[34]  Z. Lu,et al.  COME: a robust coding potential calculation tool for lncRNA identification and characterization based on multiple features , 2016, Nucleic acids research.

[35]  Hongjun Chen,et al.  Non-coding Transcripts from Enhancers: New Insights into Enhancer Activity and Gene Expression Regulation , 2017, Genom. Proteom. Bioinform..

[36]  Yong Zhang,et al.  CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine , 2007, Nucleic Acids Res..

[37]  J. Plotkin,et al.  Synonymous but not the same: the causes and consequences of codon bias , 2011, Nature Reviews Genetics.

[38]  Xuhua Xia,et al.  Cytosine usage modulates the correlation between CDS length and CG content in prokaryotic genomes. , 2006, Molecular biology and evolution.

[39]  Howard Y. Chang,et al.  Genome regulation by long noncoding RNAs. , 2012, Annual review of biochemistry.

[40]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[41]  Xue Liu,et al.  Long Non-coding RNAs and Their Biological Roles in Plants , 2015, Genom. Proteom. Bioinform..

[42]  J. Novembre Accounting for background nucleotide composition when measuring codon usage bias. , 2002, Molecular biology and evolution.

[43]  Xing Chen,et al.  LncRNADisease: a database for long-non-coding RNA-associated diseases , 2012, Nucleic Acids Res..

[44]  P. Stadler,et al.  RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription , 2007, Science.

[45]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..