Accurate Prediction of Transposon-Derived piRNAs by Integrating Various Sequential and Physicochemical Features

Background Piwi-interacting RNA (piRNA) is the largest class of small non-coding RNA molecules. The transposon-derived piRNA prediction can enrich the research contents of small ncRNAs as well as help to further understand generation mechanism of gamete. Methods In this paper, we attempt to differentiate transposon-derived piRNAs from non-piRNAs based on their sequential and physicochemical features by using machine learning methods. We explore six sequence-derived features, i.e. spectrum profile, mismatch profile, subsequence profile, position-specific scoring matrix, pseudo dinucleotide composition and local structure-sequence triplet elements, and systematically evaluate their performances for transposon-derived piRNA prediction. Finally, we consider two approaches: direct combination and ensemble learning to integrate useful features and achieve high-accuracy prediction models. Results We construct three datasets, covering three species: Human, Mouse and Drosophila, and evaluate the performances of prediction models by 10-fold cross validation. In the computational experiments, direct combination models achieve AUC of 0.917, 0.922 and 0.992 on Human, Mouse and Drosophila, respectively; ensemble learning models achieve AUC of 0.922, 0.926 and 0.994 on the three datasets. Conclusions Compared with other state-of-the-art methods, our methods can lead to better performances. In conclusion, the proposed methods are promising for the transposon-derived piRNA prediction. The source codes and datasets are available in S1 File.

[1]  Doron Betel,et al.  Computational Analysis of Mouse piRNA Sequence and Biogenesis , 2007, PLoS Comput. Biol..

[2]  Yong Huang,et al.  Regulatory long non-coding RNA and its functions , 2012, Journal of Physiology and Biochemistry.

[3]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[4]  Hui Xiao,et al.  NONCODE v3.0: integrative annotation of long noncoding RNAs , 2011, Nucleic Acids Res..

[5]  Christopher M. Player,et al.  Large-Scale Sequencing Reveals 21U-RNAs and Additional MicroRNAs and Endogenous siRNAs in C. elegans , 2006, Cell.

[6]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[7]  Vasant Honavar,et al.  Predicting flexible length linear B-cell epitopes. , 2008, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[8]  Haifan Lin,et al.  An epigenetic activation role of Piwi and a Piwi-associated piRNA in Drosophila melanogaster , 2007, Nature.

[9]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[10]  Haifan Lin,et al.  The biogenesis and function of PIWI proteins and piRNAs: progress and prospect. , 2009, Annual review of cell and developmental biology.

[11]  C. Sander,et al.  A novel class of small RNAs bind to MILI protein in mouse testes , 2006, Nature.

[12]  N. Lau,et al.  Characterization of the piRNA Complex from Rat Testes , 2006, Science.

[13]  Haifan Lin,et al.  A novel class of evolutionarily conserved genes defined by piwi are essential for stem cell self-renewal. , 1998, Genes & development.

[14]  Juan Liu,et al.  Computational Prediction of Conformational B-Cell Epitopes from Antigen Primary Structures by Ensemble Learning , 2012, PloS one.

[15]  Wing Hung Wong,et al.  SeqMap: mapping massive amount of oligonucleotides to the genome , 2008, Bioinform..

[16]  Eugene Berezikov,et al.  Piwi and piRNAs act upstream of an endogenous siRNA pathway to suppress Tc3 transposon mobility in the Caenorhabditis elegans germline. , 2008, Molecular cell.

[17]  Tao Han,et al.  Microarray scanner calibration curves: characteristics and implications , 2005, BMC Bioinformatics.

[18]  W. Theurkauf,et al.  Biogenesis and germline functions of piRNAs , 2007, Development.

[19]  havelu Meenakshisundaram,et al.  Existence of snoRNA, microRNA, piRNA characteristics in a novel non-coding RNA: x-ncRNA and its biological implication in Homo sapiens , 2009 .

[20]  N. Lau,et al.  A Broadly Conserved Pathway Generates 3′UTR-Directed Primary piRNAs , 2009, Current Biology.

[21]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[22]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[23]  Yanqing Niu,et al.  Accurate Prediction of Immunogenic T-Cell Epitopes from Epitope Sequences Using the Genetic Algorithm-Based Ensemble Learning , 2015, PloS one.

[24]  J. Mattick The Functional Genomics of Noncoding RNA , 2005, Science.

[25]  David Haussler,et al.  The UCSC Genome Browser database: 2014 update , 2013, Nucleic Acids Res..

[26]  Xiaolong Wang,et al.  Using distances between Top-n-gram and residue pairs for protein remote homology detection , 2014, BMC Bioinformatics.

[27]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[28]  Eugene Berezikov,et al.  A Role for Piwi and piRNAs in Germ Cell Maintenance and Transposon Silencing in Zebrafish , 2007, Cell.

[29]  Ying Ju,et al.  Improving tRNAscan‐SE Annotation Results via Ensemble Classifiers , 2015, Molecular informatics.

[30]  J. Claverie Fewer Genes, More Noncoding RNA , 2005, Science.

[31]  N. Lau,et al.  The coming of age for Piwi proteins. , 2007, Molecular cell.

[32]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[33]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[34]  Yi Zhang,et al.  A k-mer scheme to predict piRNAs and characterize locust piRNAs , 2011, Bioinform..

[35]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[36]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[37]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[38]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[39]  Manolis Kellis,et al.  Discrete Small RNA-Generating Loci as Master Regulators of Transposon Activity in Drosophila , 2007, Cell.

[40]  Ke Zhang,et al.  Predicting immunogenic T-cell epitopes by combining various sequence-derived features , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[41]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[42]  Haifan Lin,et al.  A novel class of small RNAs in mouse spermatogenic cells. , 2006, Genes & development.

[43]  Wei Wu,et al.  NONCODEv4: exploring the world of long non-coding RNA genes , 2013, Nucleic Acids Res..

[44]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[45]  Fei Li,et al.  Prediction of piRNAs using transposon interaction and a support vector machine , 2014, BMC Bioinformatics.

[46]  Xuhua Xia,et al.  Position Weight Matrix, Gibbs Sampler, and the Associated Significance Tests in Motif Characterization and Prediction , 2012, Scientifica.

[47]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[48]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..