MicroRNA Transcription Start Site Prediction with Multi-objective Feature Selection

MicroRNAs (miRNAs) are non-coding, short (21-23nt) regulators of protein-coding genes that are generally transcribed first into primary miRNA (pri-miR), followed by the generation of precursor miRNA (pre-miR). This finally leads to the production of the mature miRNA. A large amount of information is available on the pre- and mature miRNAs. However, very little is known about the pri-miRs, due to a lack of knowledge about their transcription start sites (TSSs). Based on the genomic loci, miRNAs can be categorized into two types —intragenic (intra-miR) and intergenic (inter-miR). While it is already an established fact that intra-miRs are commonly transcribed in conjunction with their host genes, the transcription machinery of inter-miRs is poorly understood. Although it is assumed that miRNA promoters are similar in structure to gene promoters, since both are transcribed by RNA polymerase II (Pol II), computational validations exhibit poor performance of gene promoter prediction methods on miRNAs. In this paper, we concentrate on the problem of TSS prediction for miRNAs. The present study begins with the identification of positive and negative promoter samples from recently published data stemming from RNA-sequencing studies. From these samples of experimentally validated miRNA TSSs, a number of standard sequence features are extracted. Furthermore, to account for potential footprints related to promoter regulation by CpG dinucleotide targeted DNA methylation, a number of novel features are defined. We develop a support vector machine (SVM) with RBF kernel for the prediction of miRNA TSSs trained on human miRNA promoters. A novel feature reduction technique based on archived multi-objective simulated annealing (AMOSA) identifies the final set of features. The resulting model trained on miRNA promoters shows improved performance over the one trained on protein-coding gene promoters in terms of classification accuracy, sensitivity and specificity. Results are also reported for a completely independent biologically validated test set. In a part of the investigation, the proposed approach is used to predict protein-coding gene TSSs. It shows a significantly improved performance when compared to previously published gene TSS prediction methods.

[1]  Weixiong Zhang,et al.  Characterization and Identification of MicroRNA Core Promoters in Four Model Species , 2007, PLoS Comput. Biol..

[2]  Philipp Bucher,et al.  EPD in its twentieth year: towards complete promoter coverage of selected model organisms , 2005, Nucleic Acids Res..

[3]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[4]  José Martínez-Aroza,et al.  CpGcluster: a distance-based algorithm for CpG-island detection , 2006, BMC Bioinformatics.

[5]  D. Bartel MicroRNAs: Target Recognition and Regulatory Functions , 2009, Cell.

[6]  Andrew M. Waterhouse,et al.  The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation , 2009, Genome Biology.

[7]  Jun S. Song,et al.  Chromatin structure analyses identify miRNA promoters , 2008 .

[8]  Sanghamitra Bandyopadhyay,et al.  PuTmiR: A database for extracting neighboring transcription factors of human microRNAs , 2010, BMC Bioinformatics.

[9]  Sanghamitra Bandyopadhyay,et al.  MultiMiTar: A Novel Multi Objective Optimization based miRNA-Target Prediction Method , 2011, PloS one.

[10]  Megan F. Cole,et al.  Connecting microRNA Genes to the Core Transcriptional Regulatory Circuitry of Embryonic Stem Cells , 2008, Cell.

[11]  Daniel J. Blankenberg,et al.  Galaxy: A Web‐Based Genome Analysis Tool for Experimentalists , 2010, Current protocols in molecular biology.

[12]  Zhongming Zhao,et al.  CpG islands: algorithms and applications in methylation studies. , 2009, Biochemical and biophysical research communications.

[13]  D. Zilberman,et al.  Genome-Wide Evolutionary Analysis of Eukaryotic DNA Methylation , 2010, Science.

[14]  Ujjwal Maulik,et al.  A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA , 2008, IEEE Transactions on Evolutionary Computation.

[15]  D. Corcoran,et al.  Features of Mammalian microRNA Promoters Emerge from Polymerase II Chromatin Immunoprecipitation Data , 2009, PloS one.

[16]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Ponnuthurai N. Suganthan,et al.  Identification and analysis of transcription factor family-specific features derived from DNA and protein information , 2010, Pattern Recognit. Lett..

[19]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[20]  Anton J. Enright,et al.  Genomic analysis of human microRNA transcripts , 2007, Proceedings of the National Academy of Sciences.

[21]  Sanghamitra Bandyopadhyay,et al.  Prediction of transcription start sites based on feature selection using AMOSA. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[22]  S Harbeck,et al.  Stochastic segment models of eukaryotic promoter regions. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[23]  Michael B. Stadler,et al.  Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome , 2007, Nature Genetics.

[24]  Vladimir B. Bajic,et al.  High Sensitivity TSS Prediction: Estimates of Locations Where TSS Cannot Occur , 2010, PloS one.

[25]  D. Bartel,et al.  Microarray profiling of microRNAs reveals frequent coexpression with neighboring miRNAs and host genes. , 2005, RNA.

[26]  Shuji Fujita,et al.  Putative promoter regions of miRNA genes involved in evolutionarily conserved regulatory systems among vertebrates , 2008, Bioinform..

[27]  C. Sander,et al.  A Mammalian microRNA Expression Atlas Based on Small RNA Library Sequencing , 2007, Cell.

[28]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[29]  Sanghamitra Bandyopadhyay,et al.  TargetMiner: microRNA target prediction with systematic identification of tissue-specific negative examples , 2009, Bioinform..

[30]  Yvan Saeys,et al.  Generic eukaryotic core promoter prediction using structural features of DNA. , 2008, Genome research.

[31]  Kenta Nakai,et al.  DBTSS: database of transcription start sites, progress report 2008 , 2007, Nucleic Acids Res..