ExUTR: a novel pipeline for large-scale prediction of 3′-UTR sequences from NGS data

BackgroundThe three prime untranslated region (3′-UTR) is known to play a pivotal role in modulating gene expression by determining the fate of mRNA. Many crucial developmental events, such as mammalian spermatogenesis, tissue patterning, sex determination and neurogenesis, rely heavily on post-transcriptional regulation by the 3′-UTR. However, 3′-UTR biology seems to be a relatively untapped field, with only limited tools and 3′-UTR resources available. To elucidate the regulatory mechanisms of the 3′-UTR on gene expression, firstly the 3′-UTR sequences must be identified. Current 3′-UTR mining tools, such as GETUTR, 3USS and UTRscan, all depend on a well-annotated reference genome or curated 3′-UTR sequences, which hinders their application on a myriad of non-model organisms where the genomes are not available. To address these issues, the establishment of an NGS-based, automated pipeline is urgently needed for genome-wide 3′-UTR prediction in the absence of reference genomes.ResultsHere, we propose ExUTR, a novel NGS-based pipeline to predict and retrieve 3′-UTR sequences from RNA-Seq experiments, particularly designed for non-model species lacking well-annotated genomes. This pipeline integrates cutting-edge bioinformatics tools, databases (Uniprot and UTRdb) and novel in-house Perl scripts, implementing a fully automated workflow. By taking transcriptome assemblies as inputs, this pipeline identifies 3′-UTR signals based primarily on the intrinsic features of transcripts, and outputs predicted 3′-UTR candidates together with associated annotations. In addition, ExUTR only requires minimal computational resources, which facilitates its implementation on a standard desktop computer with reasonable runtime, making it affordable to use for most laboratories. We also demonstrate the functionality and extensibility of this pipeline using publically available RNA-Seq data from both model and non-model species, and further validate the accuracy of predicted 3′-UTR using both well-characterized 3′-UTR resources and 3P–Seq data.ConclusionsExUTR is a practical and powerful workflow that enables rapid genome-wide 3′-UTR discovery from NGS data. The candidates predicted through this pipeline will further advance the study of miRNA target prediction, cis elements in 3′-UTR and the evolution and biology of 3′-UTRs. Being independent of a well-annotated reference genome will dramatically expand its application to much broader research area, encompassing all species for which RNA-Seq is available.

[1]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[2]  Hsuan-Cheng Huang,et al.  Lengthening of 3′UTR increases with morphological complexity in animal evolution , 2012, Bioinform..

[3]  Geet Duggal,et al.  Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment , 2015 .

[4]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[5]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[6]  B. Tian,et al.  Alternative polyadenylation of mRNA precursors , 2016, Nature Reviews Molecular Cell Biology.

[7]  Sebastian D. Mackowiak,et al.  The Landscape of C. elegans 3′UTRs , 2010, Science.

[8]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[9]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[10]  Xiangyin Kong,et al.  Length of the ORF, position of the first AUG and the Kozak motif are important factors in potential dual-coding transcripts , 2010, Cell Research.

[11]  S. Shabalina,et al.  The mammalian transcriptome and the function of non-coding DNA sequences , 2004, Genome Biology.

[12]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.

[13]  Zhong Wang,et al.  Next-generation transcriptome assembly , 2011, Nature Reviews Genetics.

[14]  Ernesto Picardi,et al.  UTRdb and UTRsite (RELEASE 2010): a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs , 2009, Nucleic Acids Res..

[15]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[16]  J. Pal,et al.  Role of 5′‐ and 3′‐untranslated regions of mRNAs in human diseases , 2009, Biology of the cell.

[17]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[18]  Colin N. Dewey,et al.  De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis , 2013, Nature Protocols.

[19]  Thomas Schiex,et al.  FrameDP: sensitive peptide detection on noisy matured sequences , 2009, Bioinform..

[20]  MinHyeok Kim,et al.  Global estimation of the 3' untranslated region landscape using RNA sequencing. , 2015, Methods.

[21]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[22]  Wei Li,et al.  Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across seven tumour types , 2014, Nature Communications.

[23]  C. Mayr,et al.  Widespread Shortening of 3′UTRs by Alternative Cleavage and Polyadenylation Activates Oncogenes in Cancer Cells , 2009, Cell.

[24]  Anna Tramontano,et al.  3USS: a web server for detecting alternative 3′UTRs from RNA-seq experiments , 2015, Bioinform..

[25]  C. Gissi,et al.  Structural and functional features of eukaryotic mRNA untranslated regions. , 2001, Gene.

[26]  D. Karolchik,et al.  The UCSC Genome Browser database: 2016 update , 2015, bioRxiv.

[27]  S. Kuersten,et al.  The power of the 3′ UTR: translational control and development , 2003, Nature Reviews Genetics.

[28]  Daniel R. Zerbino,et al.  Ensembl 2016 , 2015, Nucleic Acids Res..

[29]  F. Gebauer,et al.  Translational control by 3′-UTR-binding proteins , 2012, Briefings in functional genomics.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[32]  R. Elkon,et al.  Alternative cleavage and polyadenylation: extent, regulation and function , 2013, Nature Reviews Genetics.

[33]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[34]  Xun Xu,et al.  SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads , 2013, Bioinform..

[35]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[36]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..