Dragon PolyA Spotter: predictor of poly(A) motifs within human genomic DNA sequences

Motivation: Recognition of poly(A) signals in mRNA is relatively straightforward due to the presence of easily recognizable polyadenylic acid tail. However, the task of identifying poly(A) motifs in the primary genomic DNA sequence that correspond to poly(A) signals in mRNA is a far more challenging problem. Recognition of poly(A) signals is important for better gene annotation and understanding of the gene regulation mechanisms. In this work, we present one such poly(A) motif prediction method based on properties of human genomic DNA sequence surrounding a poly(A) motif. These properties include thermodynamic, physico-chemical and statistical characteristics. For predictions, we developed Artificial Neural Network and Random Forest models. These models are trained to recognize 12 most common poly(A) motifs in human DNA. Our predictors are available as a free web-based tool accessible at http://cbrc.kaust.edu.sa/dps. Compared with other reported predictors, our models achieve higher sensitivity and specificity and furthermore provide a consistent level of accuracy for 12 poly(A) motif variants. Contact: vladimir.bajic@kaust.edu.sa Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[2]  Gajendra P. S. Raghava,et al.  Prediction of Polyadenylation Signals in Human DNA Sequences using Nucleotide Frequencies , 2009, Silico Biol..

[3]  Robert M. Miura,et al.  Prediction of mRNA polyadenylation sites by support vector machine , 2006, Bioinform..

[4]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Syed Abbas Bukhari,et al.  POLYAR, a new computer program for prediction of poly(A) sites in human sequences , 2010, BMC Genomics.

[7]  Jack E. Tabaska,et al.  Detection of polyadenylation signals in human DNA sequences. , 1999, Gene.

[8]  Guoli Ji,et al.  A classification-based prediction model of messenger RNA polyadenylation sites. , 2010, Journal of theoretical biology.

[9]  D. Gautheret,et al.  Sequence determinants in human polyadenylation site selection , 2003, BMC Genomics.

[10]  Swetlana Nikolajewa,et al.  DiProDB: a database for dinucleotide properties , 2008, Nucleic Acids Res..

[11]  B. Tian,et al.  Bioinformatic identification of candidate cis-regulatory elements involved in human mRNA polyadenylation. , 2005, RNA.

[12]  P. Bernstein,et al.  Poly(A), poly(A) binding protein and the regulation of mRNA stability. , 1989, Trends in biochemical sciences.

[13]  D. Gautheret,et al.  Patterns of variant polyadenylation signal usage in human genes. , 2000, Genome research.

[14]  J. van Helden,et al.  Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. , 2000, Nucleic acids research.

[15]  V. Veljković,et al.  Simple General-Model Pseudopotential , 1972 .