Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species

In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis-determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use four species and build cross-species training sets with two of them and evaluate the performance of the remaining ones. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.

[1]  Alex ChiChung Kot,et al.  Domain Generalization with Adversarial Feature Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  G. Edwalds-Gilbert,et al.  Alternative poly(A) site selection in complex transcription units: means to an end? , 1997, Nucleic acids research.

[3]  D. Bartel,et al.  Formation, Regulation and Evolution of Caenorhabditis elegans 3′UTRs , 2010, Nature.

[4]  Eric P. Xing,et al.  Learning Robust Representations by Projecting Superficial Statistics Out , 2018, ICLR.

[5]  D. Gautheret,et al.  Sequence determinants in human polyadenylation site selection , 2003, BMC Genomics.

[6]  Christine Mayr,et al.  Evolution and Biological Roles of Alternative 3'UTRs. , 2016, Trends in cell biology.

[7]  Zhi Wei,et al.  DeepPolyA: A Convolutional Neural Network Approach for Polyadenylation Site Prediction , 2018, IEEE Access.

[8]  M. Wickens,et al.  Life and death in the cytoplasm: messages from the 3' end. , 1997, Current opinion in genetics & development.

[9]  Yi Li,et al.  Poly(A) code analyses reveal key determinants for tissue-specific mRNA alternative polyadenylation , 2016, RNA.

[10]  Rosa Maria Valdovinos,et al.  The Imbalanced Training Sample Problem: Under or over Sampling? , 2004, SSPR/SPR.

[11]  J. Graber,et al.  Signals for pre‐mRNA cleavage and polyadenylation , 2012, Wiley interdisciplinary reviews. RNA.

[12]  Shannon L. Risacher,et al.  Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning , 2012, Bioinform..

[13]  R. Elkon,et al.  Alternative cleavage and polyadenylation: extent, regulation and function , 2013, Nature Reviews Genetics.

[14]  A. Krogh,et al.  Alterations in Polyadenylation and Its Implications for Endocrine Disease , 2013, Front. Endocrinol..

[15]  T. Sano,et al.  Role of p53 mutations in endocrine tumorigenesis: mutation detection by polymerase chain reaction-single strand conformation polymorphism. , 1992, Cancer research.

[16]  M. Pospíšek,et al.  Major splice variants and multiple polyadenylation site utilization in mRNAs encoding human translation initiation factors eIF4E1 and eIF4E3 regulate the translational regulators? , 2017, Molecular Genetics and Genomics.

[17]  C. MacDonald,et al.  Reexamining the polyadenylation signal: were we wrong about AAUAAA? , 2002, Molecular and Cellular Endocrinology.

[18]  Fabio Maria Carlucci,et al.  Hallucinating Agnostic Images to Generalize Across Domains , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[19]  A. E. Erson-Bensan,et al.  Alternative Polyadenylation: Another Foe in Cancer , 2016, Molecular Cancer Research.

[20]  J. Manley,et al.  Mechanism and regulation of mRNA polyadenylation. , 1997, Genes & development.

[21]  Bin Tian,et al.  A large-scale analysis of mRNA polyadenylation of human and mouse genes , 2005, Nucleic acids research.

[22]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[23]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[24]  David Masko,et al.  The Impact of Imbalanced Training Data for Convolutional Neural Networks , 2015 .

[25]  Vladimir B. Bajic,et al.  DeepGSR: an optimized deep-learning structure for the recognition of genomic signals and regions , 2018, Bioinform..

[26]  T. Babak,et al.  A quantitative atlas of polyadenylation in five mammals , 2012, Genome research.

[27]  D. Gautheret,et al.  Patterns of variant polyadenylation signal usage in human genes. , 2000, Genome research.

[28]  G. Yehia,et al.  Analysis of alterative cleavage and polyadenylation by 3′ region extraction and deep sequencing , 2012, Nature Methods.

[29]  V. Bajic,et al.  Omni-PolyA: a method and tool for accurate recognition of Poly(A) signals in human genomic DNA , 2017, BMC Genomics.

[30]  E. Wahle,et al.  The biochemistry of polyadenylation. , 1996, Trends in biochemical sciences.

[31]  Denghui Xing,et al.  Alternative polyadenylation and gene expression regulation in plants , 2011, Wiley interdisciplinary reviews. RNA.

[32]  Peter J. Shepard,et al.  Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. , 2011, RNA.

[33]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[34]  Yongsheng Shi,et al.  Alternative polyadenylation: new insights from global analyses. , 2012, RNA.

[35]  N. Proudfoot Ending the message: poly(A) signals then and now. , 2011, Genes & development.

[36]  K. Nishida,et al.  Mechanisms and consequences of alternative polyadenylation. , 2011, Molecules and Cells.

[37]  Xiaohui Wu,et al.  Predictive modeling of plant messenger RNA polyadenylation sites , 2007, BMC Bioinformatics.

[38]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[39]  Vladimir B. Bajic,et al.  Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences , 2011, Bioinform..

[40]  B. Tian,et al.  Alternative cleavage and polyadenylation: the long and short of it. , 2013, Trends in biochemical sciences.

[41]  G. Yehia,et al.  A compendium of conserved cleavage and polyadenylation events in mammalian genes , 2018, Genome research.

[42]  Le Song,et al.  Poly(A) motif prediction using spectral latent features from human DNA sequences , 2013, Bioinform..

[43]  Yong Zeng,et al.  Genome-wide identification and predictive modeling of polyadenylation sites in eukaryotes , 2015, Briefings Bioinform..

[44]  Robert M. Miura,et al.  Prediction of mRNA polyadenylation sites by support vector machine , 2006, Bioinform..

[45]  Sayan Mukherjee,et al.  Genome-wide identification and predictive modeling of tissue-specific alternative polyadenylation , 2013, Bioinform..

[46]  Chong-Jian Chen,et al.  Differential genome-wide profiling of tandem 3' UTRs among human breast cancer and normal cells by high-throughput sequencing. , 2011, Genome research.