UClncR: Ultrafast and comprehensive long non-coding RNA detection from RNA-seq

Long non-coding RNA (lncRNA) is a large class of gene transcripts with regulatory functions discovered in recent years. Many more are expected to be revealed with accumulation of RNA-seq data from diverse types of normal and diseased tissues. However, discovering novel lncRNAs and accurately quantifying known lncRNAs is not trivial from massive RNA-seq data. Herein we describe UClncR, an Ultrafast and Comprehensive lncRNA detection pipeline to tackle the challenge. UClncR takes standard RNA-seq alignment file, performs transcript assembly, predicts lncRNA candidates, quantifies and annotates both known and novel lncRNA candidates, and generates a convenient report for downstream analysis. The pipeline accommodates both un-stranded and stranded RNA-seq so that lncRNAs overlapping with other genes can be predicted and quantified. UClncR is fully parallelized in a cluster environment yet allows users to run samples sequentially without a cluster. The pipeline can process a typical RNA-seq sample in a matter of minutes and complete hundreds of samples in a matter of hours. Analysis of predicted lncRNAs from two test datasets demonstrated UClncR’s accuracy and their relevance to sample clinical phenotypes. UClncR would facilitate researchers’ novel lncRNA discovery significantly and is publically available at http://bioinformaticstools.mayo.edu/research/UClncR.

[1]  S. Dhanasekaran,et al.  The landscape of long noncoding RNAs in the human transcriptome , 2015, Nature Genetics.

[2]  Cole Trapnell,et al.  Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. , 2011, Genes & development.

[3]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[4]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[5]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[6]  Leonard Lipovich,et al.  Genome-wide computational identification and manual annotation of human long noncoding RNA genes. , 2010, RNA.

[7]  K. Sun,et al.  iSeeRNA: identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data , 2013, BMC Genomics.

[8]  A. Chinnaiyan,et al.  TACO produces robust multi-sample transcriptome assemblies from RNA-seq , 2016, Nature Methods.

[9]  Juliane C. Dohm,et al.  Strand-specific deep sequencing of the transcriptome. , 2010, Genome research.

[10]  David G. Knowles,et al.  The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression , 2012, Genome research.

[11]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[12]  O. Elemento,et al.  Transcriptome sequencing reveals thousands of novel long non-coding RNAs in B cell lymphoma , 2015, Genome Medicine.

[13]  T. Borodina,et al.  Transcriptome analysis by strand-specific sequencing of complementary DNA , 2009, Nucleic acids research.

[14]  Michael F. Lin,et al.  Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis. , 2012, Genome research.

[15]  Zhifu Sun,et al.  Long noncoding and circular RNAs in lung cancer: advances and perspectives. , 2016, Epigenomics.

[16]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[17]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[18]  Hao Sun,et al.  Sebnif: An Integrated Bioinformatics Pipeline for the Identification of Novel Large Intergenic Noncoding RNAs (lincRNAs) - Application in Human Skeletal Muscle Cells , 2014, PloS one.

[19]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[20]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[21]  Zhifu Sun,et al.  High-throughput long noncoding RNA profiling for diagnostic and prognostic markers in cancer: opportunities and challenges. , 2015, Epigenomics.

[22]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[23]  Guojun Li,et al.  TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs , 2016, Genome Biology.

[24]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[25]  J. Kocher,et al.  CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model , 2013, Nucleic acids research.

[26]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[27]  Christopher R. Cabanski,et al.  Transcriptome sequencing reveals altered long intergenic non-coding RNAs in lung cancer , 2014, Genome Biology.