Termin(A)ntor: Polyadenylation Site Prediction Using Deep Learning Models

As a widespread RNA processing machinery, alternative polyadenylation plays a crucial role in gene regulation. To help decipher its underlying mechanism and understand its impact, it is desirable to comprehensively profile 3’-untranslated region cleavage and associated polyadenylation sites. State-of-the-art polyadenylation site detection tools are known to be influenced by library preparation artefacts or manually selected features. Moreover, recently published machine learning methods have only been tested on pre-constructed datasets, thus lacking validation on experimental data. Here we present Terminitor, the first deep neural network-based profiling pipeline to make predictions from RNA-seq data. We show how Terminitor outperforms competing tools in sensitivity and precision on experimental transcriptome sequencing data, and demonstrate its use with data from short- and long-read sequencing technologies. For species without a good reference transcriptome annotation, Terminitor is still able to pass on the information learnt from a related species and make reasonable predictions. We used Terminitor to showcase how single nucleotide variations can create or destroy polyadenylated cleavage sites in human RNA-seq samples. Author Summary 3’ cleavage and polyadenylation of pre-mRNA is part of RNA maturation process. One gene can be cleaved at different positions at its 3’ end, namely alternatively polyadenylation, thus identifying the correct polyadenylated cleavage site (poly(A) CS) is essential to unveil its role in gene regulation under different physiological and pathological conditions. The current poly(A) CS prediction tools are either heavily influenced by RNA-Seq library preparation artefacts or have only been designed and tested on ad hoc datasets, lacking association with real world applications. In this study, we present a deep learning model, Terminitor, that predicts the probability of a nucleotide sequence containing a poly(A) CS, and validated its performance on human and mouse data. Along with the model, we propose a poly(A) CS profiling pipeline for RNA-seq data. We benchmarked our pipeline against competing tools and achieved higher sensitivity and precision in experimental data. The usage of Terminitor is not limited to genome and transcriptome annotation and we expect it to facilitate the identification of novel isoforms, improve the accuracy of transcript quantification and differential expression analysis, and contribute to the repertoire of reference transcriptome annotation.

[1]  Ralf Schmidt,et al.  A comprehensive analysis of 3′ end sequencing data sets reveals novel polyadenylation signals and the repressive role of heterogeneous ribonucleoprotein C on cleavage and polyadenylation , 2015, bioRxiv.

[2]  Bin Tian,et al.  A functional human Poly(A) site requires only a potent DSE and an A-rich upstream sequence , 2010, The EMBO journal.

[3]  Inanc Birol,et al.  Recurrent tumor-specific regulation of alternative polyadenylation of cancer-related genes , 2018, BMC Genomics.

[4]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[5]  Li Yang,et al.  Genomewide characterization of non-polyadenylated RNAs , 2011, Genome Biology.

[6]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[7]  Paolo Provero,et al.  Shortening of 3′UTRs Correlates with Poor Prognosis in Breast and Lung Cancer , 2012, PloS one.

[8]  Bin Tian,et al.  PolyA_DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes , 2017, Nucleic Acids Res..

[9]  Haibo Zhang,et al.  Biased alternative polyadenylation in human tissues , 2005, Genome Biology.

[10]  David Haussler,et al.  The UCSC genome browser database: update 2007 , 2006, Nucleic Acids Res..

[11]  Wei Li,et al.  Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 3′-UTR landscape across seven tumour types , 2014, Nature Communications.

[12]  Sören Müller,et al.  APADB: a database for alternative polyadenylation and microRNA regulation events , 2014, Database J. Biol. Databases Curation.

[13]  V. Bajic,et al.  Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA , 2017, BMC Genomics.

[14]  Thomas Bonfert,et al.  Prediction of Poly(A) Sites by Poly(A) Read Mapping , 2017, PloS one.

[15]  Julie L. Yang,et al.  Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression , 2013, Genes & development.

[16]  Jie Li,et al.  APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals , 2014, Nucleic Acids Res..

[17]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[18]  Elizabeth M. Smigielski,et al.  dbSNP: a database of single nucleotide polymorphisms , 2000, Nucleic Acids Res..

[19]  Stephan Fuchs,et al.  Genome-based analysis of Carbapenemase-producing Klebsiella pneumoniae isolates from German hospital patients, 2008-2014 , 2018, Antimicrobial Resistance & Infection Control.

[20]  K. Venkataraman,et al.  Analysis of a noncanonical poly(A) site reveals a tripartite mechanism for vertebrate poly(A) site recognition. , 2005, Genes & development.

[21]  Zhiqiao Wang,et al.  Cleavage and polyadenylation: Ending the message expands gene regulation , 2017, RNA biology.

[22]  Serena H. Chen,et al.  An up-close look at the pre-mRNA 3’-end processing complex , 2009, RNA biology.

[23]  V. Kim,et al.  TAIL-seq: genome-wide determination of poly(A) tail length and 3' end modifications. , 2014, Molecular cell.

[24]  Yanchun Liang,et al.  3′UTR shortening identifies high-risk cancers with targeted dysregulation of the ceRNA network , 2014, Scientific Reports.

[25]  Xiaohui Wu,et al.  APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data , 2018, Bioinform..

[26]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[27]  Sarah C. Ayling,et al.  The Ensembl gene annotation system , 2016, Database J. Biol. Databases Curation.

[28]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[29]  H. Nakazato,et al.  Polyadenylic acid sequences in the heterogeneous nuclear RNA and rapidly-labeled polyribosomal RNA of HeLa cells: possible evidence for a precursor relationship. , 1971, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Quaid Morris,et al.  QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data , 2018, Genome Biology.

[31]  Vladimir B. Bajic,et al.  DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions , 2018, Bioinform..

[32]  T. Babak,et al.  A quantitative atlas of polyadenylation in five mammals , 2012, Genome research.

[33]  Peter J. Shepard,et al.  Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. , 2011, RNA.

[34]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[35]  Donald Sharon,et al.  Defining a personal, allele-specific, and single-molecule long-read transcriptome , 2014, Proceedings of the National Academy of Sciences.

[36]  Yu Li,et al.  DeeReCT-PolyA: a robust and generic deep learning method for PAS identification , 2018, Bioinform..

[37]  Bin Tian,et al.  A large-scale analysis of mRNA polyadenylation of human and mouse genes , 2005, Nucleic acids research.

[38]  Lucie N. Hutchins,et al.  Systematic variation in mRNA 3′-processing signals during mouse spermatogenesis , 2006, Nucleic acids research.

[39]  Christine Mayr,et al.  Evolution and Biological Roles of Alternative 3'UTRs. , 2016, Trends in cell biology.

[40]  Torsten Seemann,et al.  PAT-seq: a method to study the integration of 3′-UTR dynamics with gene expression in the eukaryotic transcriptome , 2015, RNA.

[41]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[42]  J. Graber,et al.  Signals for pre‐mRNA cleavage and polyadenylation , 2012, Wiley interdisciplinary reviews. RNA.

[43]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[44]  Justin Chu,et al.  RNA-Bloom provides lightweight reference-free transcriptome assembly for single cells , 2019, bioRxiv.

[45]  Sue Fletcher,et al.  Regulation of eukaryotic gene expression by the untranslated gene regions and other non-coding elements , 2012, Cellular and Molecular Life Sciences.

[46]  Le Song,et al.  Poly(A) motif prediction using spectral latent features from human DNA sequences , 2013, Bioinform..

[47]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[48]  Inanç Birol,et al.  KLEAT: Cleavage Site Analysis of Transcriptomes , 2014, Pacific Symposium on Biocomputing.

[49]  Tao Jiang,et al.  DeepPASTA: deep neural network based polyadenylation site analysis , 2019, Bioinform..

[50]  Vladimir B. Bajic,et al.  Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences , 2011, Bioinform..

[51]  Jorng-Tzong Horng,et al.  Characterization and prediction of mRNA polyadenylation sites in human genes , 2011, Medical & Biological Engineering & Computing.

[52]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..