Flnc: Machine Learning Improves the Identification of Novel Long Noncoding RNAs from Stand-Alone RNA-Seq Data

Long noncoding RNAs (lncRNAs) play critical regulatory roles in human development and disease. Although there are over 100,000 samples with available RNA sequencing (RNA-seq) data, many lncRNAs have yet to be annotated. The conventional approach to identifying novel lncRNAs from RNA-seq data is to find transcripts without coding potential but this approach has a false discovery rate of 30–75%. Other existing methods either identify only multi-exon lncRNAs, missing single-exon lncRNAs, or require transcriptional initiation profiling data (such as H3K4me3 ChIP-seq data), which is unavailable for many samples with RNA-seq data. Because of these limitations, current methods cannot accurately identify novel lncRNAs from existing RNA-seq data. To address this problem, we have developed software, Flnc, to accurately identify both novel and annotated full-length lncRNAs, including single-exon lncRNAs, directly from RNA-seq data without requiring transcriptional initiation profiles. Flnc integrates machine learning models built by incorporating four types of features: transcript length, promoter signature, multiple exons, and genomic location. Flnc achieves state-of-the-art prediction power with an AUROC score over 0.92. Flnc significantly improves the prediction accuracy from less than 50% using the conventional approach to over 85%. Flnc is available via GitHub platform.

[1]  A. Hutchins,et al.  The effects of sequencing depth on the assembly of coding and noncoding transcripts in the human genome , 2022, bioRxiv.

[2]  Christopher D. Brown,et al.  Population-scale tissue transcriptomics maps long non-coding RNAs to complex disease , 2021, Cell.

[3]  C. Subramanian,et al.  LIMIT is an immunogenic lncRNA in cancer immunity and immunotherapy , 2021, Nature Cell Biology.

[4]  M. Tu,et al.  Role of MALAT1 in gynecological cancers: Pathologic and therapeutic aspects. , 2021, Oncology letters.

[5]  James C. Wright,et al.  GENCODE 2021 , 2020, Nucleic Acids Res..

[6]  Maite Huarte,et al.  Gene regulation by long non-coding RNAs and its biological functions , 2020, Nature reviews. Molecular cell biology.

[7]  J. Keele,et al.  Evaluation of transcript assembly in multiple porcine tissues suggests optimal sequencing depth for RNA-Seq using total RNA library , 2020 .

[8]  Jun Zhang,et al.  Distinct Processing of lncRNAs Contributes to Non-conserved Functions in Stem Cells , 2020, Cell.

[9]  M. Pirooznia,et al.  In vivo functional analysis of non-conserved human lncRNAs associated with cardiometabolic traits , 2020, Nature Communications.

[10]  G. Pertea,et al.  GFF Utilities: GffRead and GffCompare. , 2020, F1000Research.

[11]  Weidong Zhu,et al.  Mechanisms and Functions of Long Non-Coding RNAs at Multiple Regulatory Levels , 2019, International journal of molecular sciences.

[12]  L. Floeter-Winter,et al.  Long Non-Coding RNAs in the Regulation of Gene Expression: Physiology and Disease , 2019, Non-coding RNA.

[13]  S. P. Moran,et al.  lncRNA DIGIT and BRD3 protein form phase-separated condensates to regulate endoderm differentiation , 2019, bioRxiv.

[14]  Xiaoxue Tong,et al.  CPPred: coding potential prediction based on the global description of RNA sequence , 2019, Nucleic acids research.

[15]  Lennart Martens,et al.  LNCipedia 5: towards a reference set of human long non-coding RNAs , 2018, Nucleic Acids Res..

[16]  Joshua D. Eaton,et al.  An end in sight? Xrn2 and transcriptional termination by RNA polymerase II , 2018, Transcription.

[17]  Vladimir B. Bajic,et al.  Characterization and identification of long non-coding RNAs based on feature relationship , 2018, bioRxiv.

[18]  Michael Q. Zhang,et al.  NONCODEV5: a comprehensive annotation database for long non-coding RNAs , 2017, Nucleic Acids Res..

[19]  M. Széll,et al.  VELUCT, a long non-coding RNA with an important cellular function despite low abundance. , 2017, Journal of thoracic disease.

[20]  Kui Li,et al.  Systematic Identification and Molecular Characteristics of Long Noncoding RNAs in Pig Tissues , 2017, BioMed research international.

[21]  Guojun Li,et al.  TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs , 2016, Genome Biology.

[22]  A. Sigova,et al.  DIGIT Is a Conserved Long Noncoding RNA that Regulates GSC Expression to Control Definitive Endoderm Differentiation of Embryonic Stem Cells. , 2016, Cell reports.

[23]  R. Chung,et al.  Long noncoding RNAs expressed in human hepatic stellate cells form networks with extracellular matrix proteins , 2016, Genome Medicine.

[24]  Julie A. Dickerson,et al.  Strawberry: Fast and accurate genome-guided transcript reconstruction and quantification from RNA-Seq , 2016, bioRxiv.

[25]  A. Regev,et al.  Evolutionary analysis across mammals reveals distinct classes of long non-coding RNAs , 2015, Genome Biology.

[26]  Howard Y. Chang,et al.  Unique features of long non-coding RNA biogenesis and function , 2015, Nature Reviews Genetics.

[27]  E. Schadt,et al.  Deciphering H3K4me3 broad domains associated with gene-regulatory networks and conserved epigenomic landscapes in the human brain , 2015, Translational Psychiatry.

[28]  Maite Huarte The emerging role of lncRNAs in cancer , 2015, Nature Medicine.

[29]  Xi Chen,et al.  Broad H3K4me3 is associated with increased transcription elongation and enhancer activity at tumor-suppressor genes , 2015, Nature Genetics.

[30]  D. Bartel,et al.  Principles of long noncoding RNA evolution derived from direct comparison of transcriptomes in 17 species. , 2015, Cell reports.

[31]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[32]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[33]  Guojun Li,et al.  The Impacts of Read Length and Transcriptome Complexity for De Novo Assembly: A Simulation Study , 2014, PloS one.

[34]  Aimin Li,et al.  PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme , 2014, BMC Bioinformatics.

[35]  Zhen Su,et al.  Integrative genomic analyses reveal clinically relevant long non-coding RNA in human cancer , 2013 .

[36]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[37]  Albert E. Almada,et al.  Divergent transcription of long noncoding RNA/mRNA gene pairs in embryonic stem cells , 2013, Proceedings of the National Academy of Sciences.

[38]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[39]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[40]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[41]  D. Bartel,et al.  Conserved Function of lincRNAs in Vertebrate Embryonic Development despite Rapid Sequence Evolution , 2011, Cell.

[42]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[43]  N. Friedman,et al.  Trinity : reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2016 .

[44]  Nick Goldman,et al.  RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. , 2011, RNA.

[45]  Thomas Lengauer,et al.  Permutation importance: a corrected feature importance measure , 2010, Bioinform..

[46]  Cole Trapnell,et al.  Role of Rodent Secondary Motor Cortex in Value-based Action Selection Nih Public Access Author Manuscript , 2006 .

[47]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[48]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[49]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[50]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[51]  R. Young,et al.  A Chromatin Landmark and Transcription Initiation at Most Promoters in Human Cells , 2007, Cell.

[52]  D. Bentley,et al.  A Ribonucleolytic Rat Torpedoes RNA Polymerase II , 2004, Cell.

[53]  Rongxiang Liu,et al.  Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. , 2002, Genome research.

[54]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[55]  Victor V. Solovyev,et al.  The Gene-Finder Computer Tools for Analysis of Human and Model Organisms Genome Sequences , 1997, ISMB.